paragraph.getRuns() to separate a paragraph - apache-poi

When I use Apache poi to change the date of a contract automatically, i am very confused at the how dose paragraph.getRuns() separate a paragraph. I have the following paragraph
自 2014 年 10 月 1 日起至 2014 年 10 月 31 日止
I use the following code to see how many XWPFRun does paragraph.getRuns() return
String currentParagraph = "";
for(XWPFRun xwpfRun : paragraph.getRuns()){
currentParagraph += xwpfRun.getText(0);
System.out.println(currentParagraph);
}
I find the first five number are all a xwpfRun independently,eg ,2014,10,1
but the last number "31" was separated into two xwpfRun :"3", and "1";
This makes it hard to change the date by xwpfRun ,and I want to know How to deal with this and how does paragraph.getRuns() works?

Sometimes text in DOCX files gets broken up into an arbitrary number of runs. While inconvenient, it's not to difficult to deal with.
The solution is to iterate all runs in the paragraph, and concatenate the text from each to a string. Then, update the date and store it as the text of the first run. Lastly, you can remove or set the text in the other runs to "".

Related

Using REGEX to grab the information after the match

I ran a PDF through a series of processes to extra the text from it. I was successful in that regard. However, now I want to extract specific text from documents.
The document is set up as a multi lined string (I believe. when I paste it into Word the paragraph character is at the end of each line):
Send Unit: COMPLETE
NOA Selection: 20-0429.07
#for some reason, in this editor, despite the next line having > infront of it, the following line (Pni/Trk) keeps wrapping up to the line above. This doesn't exist in the actual doc.
Pni/Trk: 3 Panel / 3 Track
Panel Stack: STD
Width: 142.0000
The information is want to extract are the numbers following "NOA Selection:".
I know I can do a regex something to the effect of:
pattern = re.compile(r'NOA\sSelection:\s\d*-\d*\.\d*)
but I only want the numbers after the NOA selection, especially because NOA Selection will always be the same but the format of the numbers/letters/./-/etc. can vary pretty wildly. This looked promising but it is in Java and I haven't had much luck recreating it in Python.
I think I need to use (?<=...), but haven't been able to implement it.
Also, several of the examples show the string stored in the python file as a variable, but I'm trying to access it from a .txt file, so I might be going wrong there. This is what I have so far.
with open('export1.txt', 'r') as d:    
contents = d.read()    
p = re.compile('(?<=NOA)')
s = re.search(p, contents)
print(s.group())
Thank you for any help you can provide.
With your shown samples, you could try following too. For sample 20-0429.07 I have kept .07 part optional in regex in case you have values 20-0429 only it should work for those also.
import re
val = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'NOA\s+Selection:\s+(\d+-\d+(?:\.\d+)?)', val)
print(matches)
['20-0429.07']
Explanation: Adding detailed explanation(only for explanation purposes).
NOA\s+Selection:\s+ ##matching NOA spaces(1 or more occurrences) Selection: spaces(1 or more occurrences)
(\d+-\d+(?:\.\d+)?) ##Creating capturing group matching(1 or more occurrences) digits-digits(1 or more occurrences)
##and in a non-capturing group matching dot followed by digits keeping it optional.
Keeping it simple, you could use re.findall here:
inp = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'\bNOA Selection: (\S+)', inp)
print(matches) # ['20-0429.07']

how to modify textfile using U-SQL

I have a large file of around 130MB containing 10 A characters in each line and \t at the end of 10th "A" character, I want to extract this text file and then change all A's to B's. Can any one help with its code snippet?
this is what I have wrote till now
USE DATABASE imodelanalytics;
#searchlog =
EXTRACT characters string
FROM "/iModelAnalytics/Samples/Data/dummy.txt"
USING Extractors.Text(delimiter: '\t', skipFirstNRows: 1);
#modify =
SELECT characters AS line
FROM #searchlog;
OUTPUT #modify
TO "/iModelAnalytics/Samples/Data/B.txt"
USING Outputters.Text();
I'm new to this, so any suggestions will be helpful ! Thanks
Assuming all of the field would be AAAAAAAAAA then you could write:
#modify = SELECT "BBBBBBBBBB" AS characters FROM #searchlog;
If only some are all As, then you would do it in the SELECT clause:
#modify =
SELECT (characters == "AAAAAAAAAA" ? "BBBBBBBBBB" : characters) AS characters
FROM #searchlog;
If there are other characters around the AAAAAAAAAA then you would use more of the C# string functions to find them and replace them in a similar pattern.

NPOI: Achieve Currency format as if formatted by Excel

I have seen some questions (like this one) here asking about if a cell in Excel can be formatted by NPOI/POI as if formatted by Excel. As most of you, I have to deal with issues with Currency and DateTime. Here let me ask how the formatting can be achieved as if it has been formatted by Excel? (I will answer this question myself as to demonstrate how to do it.)
Setting: Windows 10, English, Region: Taiwan
Excel format: XLSX (version 2007 and later)
(Sorry about various edit of this question as I have pressed the 'Enter' button at unexpected time.)
If you format a cell as Currency, you have 4 choices:
The internal format of each style is as follow:
-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00" numFmtId="164"/>
[RED]NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00;[Red]"NT$"#,##0.00" numFmtId="164"/>
-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00_);("NT$"#,##0.00)" numFmtId="7"/>
[RED]-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00_);[Red]("NT$"#,##0.00)" numFmtId="8"/>
Note: There is a pair of double quote (") comes before and after NT$.
(To get internal format of XLSX, just unzip it. The Style information is available in <unzip dir>\xl\Styles.xml Check out this answer if you need more information.)
(FYI: In formatCode, the '0' represent a digit. The '#' also represent a digit, but will not appear if the number is not large enough. So any number less than 1000 will not have the comma inside it. The '_' is a space holder. In format 3, '1.75' appears as 'NT$1.75 '. The last one is a space.)
(FYI: In numFmtId, for case 1 and case 2, number 164 is for user-defined. For case 3 and 4, number 7 and 8 are build-in style.)
For developers using POI/NPOI, you may find out if you format your currency column using Build In Format using 0x7 or 0x8, you can get only the third or fourth choice. You cannot get the first or second choice.
To get the first choice, you build upon style 0x7 "$#,##0.00);($#,##0.00)". You need to add the currency symbol and the pair of double quotes in front of it.
styleCurrency.DataFormat = workbook.CreateDataFormat().GetFormat("\"NT$\"#,##0.00");
Apply this format to a cell with number. Once you open the Excel result file, right click to check formatting, you will see the first choice.
Please feel free to comment on this post.
var cell5 = row.CreateCell(5, CellType.Numeric);
cell5.SetCellValue(item.OrderTotal);
var styleCurrency = workbook.CreateCellStyle();
styleCurrency.DataFormat= workbook.CreateDataFormat().GetFormat(string.Format("\"{0}\"#,##0.00", item.CurrencySymbol));//styleCurrency;
cell5.CellStyle = styleCurrency;
styleCurrency = null;
Iterate over loop for multiple currency.
Function to GetCurrencySymbol against currency Code on C#
private string GetCurencySymbol(string isOcurrencyCode)
{
return CultureInfo.GetCultures(CultureTypes.AllCultures).Where(c => !c.IsNeutralCulture)
.Select(culture =>
{
try
{
return new RegionInfo(culture.LCID);
}
catch
{
return null;
}
})
.Where(ri => ri != null && ri.ISOCurrencySymbol == isOcurrencyCode)
.Select(ri => ri.CurrencySymbol).FirstOrDefault();}

Tell if specific char in string is a long char or a short char

Be prepared, this is one of those hard questions.
In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification.
take a look at this image
All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right.
Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it).
i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA.
Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode?
I tried the unicodes
string unicodes = "";
foreach (char c in "تحصیلداری")
{
unicodes += c+" "+((int)c).ToString() + Environment.NewLine;
}
MessageBox.Show(unicodes);
result :
but at the end of the day unfortunately all YAs have the same unicode.
Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too.
Additional info :
using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow:
We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc.
Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate.
We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose.
After several iterations, the replacement is as bellow that may come in handy:
Formula="UPPER(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(FirsName || LastName || FatherName,
chr(32),''),
chr(13),''),
chr(9),''),
chr(10),''),
'-',''),
'-',''),
'آ','ا'),
'أ', 'ا'),
'ئ', 'ي'),
'ي', 'ي'),
'ك', 'ک'),
'آإئؤةي','اايوهي'),
'ء',''),
'شأل','شاال'),
'ا.','اله'),
'.',''),
'الله','اله'),
'ؤ','و'),
'إ','ا'),
'ة','ه'),
' ا لله','اله'),
'ا لله','اله'),
' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code.
But you can reach your goal be checking to see what characters is before or after YEH.
You can also use Fardis to see Unicode codes of strings.

Seperated text line in Apache POI XWPFRun object

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.
My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.
Best
This wasted so much of my time once...
Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.
So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.
How to avoid splitting of Runs in such cases?
Solution: There are two solutions that I know of:
Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.
Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.
If you have to change your new placeholder again, follow step 1 or 2.
Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.
public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
List<Integer> group = new ArrayList<Integer>();
if (runs != null) {
String groupText = search;
for (int i=0 ; i<runs.size(); i++) {
XWPFRun r = runs.get(i);
String text = r.getText(0);
if (text != null)
if(text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text, 0);
}
else if(groupText.startsWith(text)){
group.add(i);
groupText = groupText.substring(text.length());
if(groupText.isEmpty()){
runs.get(group.get(0)).setText(replace, 0);
for(int j = 1; j<group.size(); j++){
p.removeRun(group.get(j));
}
group.clear();
groupText = search;
}
}else{
group.clear();
groupText = search;
}
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text);
}
}
}
}
}
}
}
For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words. What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
}
}
I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text. For me one of a additional advantage was that visualy I'm able to faster find placeholders in text.
So finally above loop looks like that:
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
r.isBold(false);
}
}
I hope it will help to someone, while I spend too much time for that :)
I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.
To be sure that a word will be consider as a single XWPFRun,
You can use merge_field as variable in word like that
Place cursor on the word you want to be a single run.
Press CTRL and F9 together and { } in gray will appear.
Right-click on the { } field and select Edit Field.
In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
Click OK.

Resources