I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.
My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.
Best
This wasted so much of my time once...
Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.
So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.
How to avoid splitting of Runs in such cases?
Solution: There are two solutions that I know of:
Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.
Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.
If you have to change your new placeholder again, follow step 1 or 2.
Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.
public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
List<Integer> group = new ArrayList<Integer>();
if (runs != null) {
String groupText = search;
for (int i=0 ; i<runs.size(); i++) {
XWPFRun r = runs.get(i);
String text = r.getText(0);
if (text != null)
if(text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text, 0);
}
else if(groupText.startsWith(text)){
group.add(i);
groupText = groupText.substring(text.length());
if(groupText.isEmpty()){
runs.get(group.get(0)).setText(replace, 0);
for(int j = 1; j<group.size(); j++){
p.removeRun(group.get(j));
}
group.clear();
groupText = search;
}
}else{
group.clear();
groupText = search;
}
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text);
}
}
}
}
}
}
}
For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words. What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
}
}
I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text. For me one of a additional advantage was that visualy I'm able to faster find placeholders in text.
So finally above loop looks like that:
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
r.isBold(false);
}
}
I hope it will help to someone, while I spend too much time for that :)
I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.
To be sure that a word will be consider as a single XWPFRun,
You can use merge_field as variable in word like that
Place cursor on the word you want to be a single run.
Press CTRL and F9 together and { } in gray will appear.
Right-click on the { } field and select Edit Field.
In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
Click OK.
Related
I have an UItextView and i set the AttributedText at the starting of the app so with en empty text because user didn't yet fill anything. The problem is that with an empty String the AttributedText seam to not apply for the new text i will enter. how to do ?
As far as I know there is no straight solution for that. Setting AttributedText with "" doesn't work.
However you can do easy fix:
if let text = field.attributedText?.string {
//normal way
} else {
field.font = ...
field.fontColor = ...
//sorry, no shadow and other nice tricks
}
Of course you could implement delegate to text field and adjust attributes when textFieldDidChange, but that doesn't work well with typing in Chinese language where letter can be composed from multiple characters so I couldn't use that.
What is the best way to find a string (sentence of 1-3 lines) in the multiline textfield.
I have a textfield with a list of messages. In order to change every second messages color, i have to get the index where this message beggins.
ANy ideas?
I solved my problem. Maybe it will be useful for someone.
As i'm appending text, i use textfield.caretIndex to see the inserts. So i'm switching formats using this function:
if (i % 2 != 0) {
textfield.setTextFormat(colorFormat, lastCaret , textfield.caretIndex);
formatStart = textfield.caretIndex;
}
else {
textfield.setTextFormat(textFormat, formatStart, textfield.caretIndex);
lastCaret = textfield.caretIndex;
}
I've got a problem with data in CSV file to import into Excel.
I've parsed data from a website, and contain line break <br>, I convert this tag into "\n" and write to a CSV file. However, when I import this CSV file into Excel, the line-break display incorrectly. It results new line as a new row instead of a new line in a single cell itself.
Anyone have face this problem before? Really appreciate your suggestion.
Thanks!
Edit: Here the sample to demonstrate my situation
static void TestLine()
{
string sampleData = "日찬양 까페에 올린 충격적인 <br>글코리아타임스";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
I want the sampleData displaying in a cell, however, the result happens in 2 cells.
It looks to me like misformated CSV file. To handle this situation correctly, you will need to have a field with line break contained within quotation marks.
UPDATE after the sample:
This works fine:
static void Main(string[] args)
{
string sampleData = "\"日찬양 까페에 올린 충격적인 <br>글코리아타임스\"";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
You need to wrap the field in the quotation marks (")
Please take a look at CSV-1203 File Format Specification, in particular the sections on "End-of-Record Marker" and "Field Payload Protection". Hopefully this should give clear guidance on the inner workings of the CSV file format.
I want to test if a CKEditor ( Rich Text ) field is empty as part of some business logic.
I do not want to use the built in validation features.
If a CK Editor field has previously had text and then this text is deleted there is still content e.g.
<p dir="ltr">
</p>
I can get a handle to this text string using :
dataVar = xspdoc.getDocument().getMIMEEntity(dataNamevar).getContentAsText();
Is there a way to test if the CKEditor field is empty of visible text ?
Technically speaking, if it has what amounts to a a single visible newline in it as you've shown in your question, it isn't really "empty".
Realistically, you'll have to parse the content value to find out if there is content that is not either inside tags or the few special characters like and so on.
I tend to do this in js, if I have to, by taking the whole string of text and splitting it into an array based on "<" then taking each element of the array and removing an text to the left of an ">", then trim. That leaves me an array of either empty strings or text that is outside any tags. From there it's easy enough check for any of strings in the array to see if they are not empty, and not " ".
This may be more cumbersome then some built in parser that I don't know, but it's fairly reliable and quick. (and a very similar method can be used in formula language as well).
In ssjs formula you could:
var checkString = #trim(#replacesubstring(#implode( #trim (#right( #explode( sourceHTMLstring , "<" ) , ">" ) ) , " "), " " , ""));
if(checkstring == "") {
// *** You have no content
} else {
// *** you have content
}
Obviously this could be done just as easily in pure javascript, but the old formula language is so ingrained in my head, I'd go this way just out of habit.
** Also note: You may want to check for an <img> tag in there somewhere in case someone has done absolutely nothing other than put an image in the rich text.
CKEditor has its own API, I guess this is the right method to use:
http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.editor.html#getData
This might be helpful: http://xpagetips.blogspot.com/2011/10/be-careful-with-empty-ckeditor-rich.html
Check if CKEditor is empty
For any browser
var editor=CKEDITOR.instances.editorName.getData();
I found best answer for this
function validateCKEDITORforBlank(ckData)
{
ckData = ckData.replace(/<[^>]*>|\s/g, '');
var vArray = new Array();
vArray = ckData.split(" ");
var vFlag = 0;
for(var i=0;i<vArray.length;i++)
{
if(vArray[i] == '' || vArray[i] == "")
{
continue;
}
else
{
vFlag = 1;
break;
}
}
if(vFlag == 0)
{
return true;
}
else
{
return false;
}
}
Link
I currently have an Event Receiver that is attached to a custom list. My current requirement is to implement column level security for a Rich Text field (Multiple lines of text with enhanced rich text).
According to this post[webarchive], I can get the field's before and after values like so:
object oBefore = properties.ListItem[f.InternalName];
object oAfter = properties.AfterProperties[f.InternalName];
The problem is that I'm running to issues comparing these two values, which lead to false positives (code is detecting a change when there wasn't one).
Exhibit A: Using ToString on both objects
oBefore.ToString()
<div class=ExternalClass271E860C95FF42C6902BE21043F01572>
<p class=MsoNormal style="margin:0in 0in 0pt">Text.
</div>
oAfter.ToString()
<DIV class=ExternalClass271E860C95FF42C6902BE21043F01572>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt">Text.
</DIV>
Problems?
HTML tags are capitalized
Random spaces (see the additional space after margin:)
Using GetFieldValueForEdit or GetFieldValueAsHTML seem to result in the same values.
"OK," you say, so lets just compare the plain text values.
Exhibit B: Using GetFieldValueAsText
Fortunately, this method strips all of the HTML tags out of the value and only plain text is displayed. However, using this method led me to discover additional issues with whitespace characters:
In the before value:
Sometimes there are additional newline characters.
Sometimes spaces are displayed as non-breaking spaces (ASCII char code 160)
Question:
How can I detect if the user changed a rich text field in an event receiver?
[Ideal] Detect any change to HTML or text or white space
[Acceptable] Detect changes to text or white space
[Not so good] Detect changes to text characters only (strip all non-alphanumeric characters)
What happens if you set the ListItem field with the new value and read it back out? Does that give the same formatting?
object oBefore = properties.ListItem[f.InternalName];
properties.ListItem[f.InternalName] = properties.AfterProperties[f.InternalName]
object oAfter = properties.ListItem[f.InternalName];
//dont update
properties.ListItem[f.InternalName] = oBefore;
I would probably try something between choices 2 and 3:
bool changed =
valueAsTextBefore != valueAsTextAfter ||
0 != string.Compare(
oBefore.ToString().Replace(" ", ""),
oAfter.ToString().Replace(" ", ""),
true);
The left half checks if the text (including case) has changed while the right half checks if the tags or attributes have changed. Very kludgy, but should fit your case.
The only other thing I can think of is to run an XML transform on the HTML in order to standardize on case and spacing. But not only does that seem like overkill, but it assumes the HTML will always be well formed.
I'm currently testing a combination approach: GetFieldValueAsText and then stripping out all characters except alphanumeric/punctuation:
static string GetRichTextValue(string value)
{
if (null == value)
{
return string.Empty;
}
StringBuilder sb = new StringBuilder(value.Length);
foreach (char c in value)
{
if (char.IsLetterOrDigit(c) || char.IsPunctuation(c))
{
sb.Append(c);
}
}
return sb.ToString();
}
This only detects changes to the text of a rich text field but seems to work consistently.