Extract wrong paragraph direction in Word using Apache POI library

Extract wrong paragraph direction in Word using Apache POI library - apache-poi

I need to extract the paragraph directions in Word documents (both .doc and .docx).
For .doc documents, I use paragraph.cloneProperties().getFBiDi to get the directional information for paragraphs.
For .docx documents, I use paragraph.getCTP != null && paragraph.getCTP.isSetPPr && paragraph.getCTP.getPPr.isSetBidi.
When HWPFDocument or XWPFDocument reads Arabic/Hebrew documents, I would expect that getFBiDi/isSetBiDi returns TRUE for the RTL paragraphs. However, it returns FALSE that represents "LTR" direction for some documents see the example.
Did I use the wrong API to get the paragraph directions?

Use below lines to set direction:
if ( par.getCTP().getPPr()==null ) {
par.getCTP().addNewPPr();
}
if ( par.getCTP().getPPr().getBidi()==null ) {
par.getCTP().getPPr().addNewBidi();
}
par.getCTP().getPPr().getBidi().setVal(STOnOff.ON); //or STOnOff.OFF

Related

Editing XML ProtectedString CDATA

Third time posting on StackoverFlow so I apologize for any mistakes. :)
I am trying to edit the Character Data of ProtectedString
<ProtectedString name="Source"><![CDATA[print("CHANGEME")]]><ProtectedString>
It prints on the output
<![CDATA[print("CHANGEME")]]>
But does not save.
Code so far:
fs.readFile('./1.rbxlx', 'utf8', function(err, data) {
var result = xmljs.xml2js(data, {
compact: true,
spaces: 2
})
for (var i = 17; i <= 17; i++) {
result.roblox["Item"][0]["Item"][i].Properties.ProtectedString = '<![CDATA[print("CHANGED")]]>'
console.log(result.roblox["Item"][0]["Item"][i].Properties.ProtectedString)
}
fs.writeFile("./1.rbxlx", xmljs.js2xml(result, {
compact: true,
spaces: 4,
fullTagEmptyElement: true
}), function(err, data) {
if (err) console.error(err)
})
})
Thank you in advance :)

CDATA does not really exist.
It's a way to represent a text value during serialization. "Serialization" means "converting a complex data structure to text" - and "XML" is the representation of a tree of nodes. The tree of nodes (a.k.a. the document) is the important thing here, "XML" is nothing more than a container format.
CDATA is not part of the document tree. After you read a file with an XML parser (such as xml2js), CDATA will be gone, just like all the angle brackets and the quotes from the XML are gone - what remains is the value it stood for:
console.log(result.roblox["Item"][0]["Item"][i].Properties.ProtectedString);
This will log the actual text value:
print("CHANGEME")
So you need to update the actual text value:
result.roblox["Item"][0]["Item"][i].Properties.ProtectedString = 'print("CHANGED")';
When the document gets converted to XML again ("serialized"), there probably won't be a <![CDATA[...]]> wrapper around the value anymore. That's a minor detail that you can safely ignore. CDATA is completely optional and XML does not need it to function.
The component that writes the XML (the serializer) decides whether it wants to use CDATA for a text value or not. In xml2js, this component is the Builder. You can tell the Builder to use CDATA via an option, but your control over where it does it is limited:
cdata (default: false): wrap text nodes in <![CDATA[ ... ]]> instead of escaping when necessary. Does not add <![CDATA[ ... ]]>
if it is not required. Added in 0.4.5.
It might even add CDATA where there was none before.
Overall: Don't worry about it. The text values will be correct with or without CDATA. Don't build software or processes that depend on CDATA being there, that's always a mistake.

Gradle: How to filter and search through text?

I'm fairly new to gradle. How do I filter text in the following manner?
Pretend that the output/result I want to filter will be the two URLs below.
"http://localhost/artifactory/appNameIwant/moreStuffHereThatsDynamic"
> I want this URL
"http://localhost/artifactory/differentAppName"
> I don't want this URL
I want to put up a "match" variable that would be something like
variable = http://localhost/artifactory/appnameIwant
So essentially, the string will not be a perfect match. I want it to filter and provide back any URLs that start with the variable listed above. It cannot be a perfect match as the characters after the /appnameIwant/ will be changing.
I want to use a for loop to cycle through an array, with an if then statement to return any matches. For instance.
for (i=0; i < results.length; i++){
if (results[i] strings matches (http://localhost/artifactory/appnameIwant) {
return results[i] }
I am just filtering the URL strings themselves, not anything complicated inside the webpages.
Let me know if further explanation would be helpful.
Thanks so much for your time and help!

I figured it out - I just used
if (string.startsWith"texthere")) {println string}
A lot easier than I thought!

Seperated text line in Apache POI XWPFRun object

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.
My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.
Best

This wasted so much of my time once...
Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.
So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.
How to avoid splitting of Runs in such cases?
Solution: There are two solutions that I know of:
Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.
Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.
If you have to change your new placeholder again, follow step 1 or 2.

Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.
public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
List<Integer> group = new ArrayList<Integer>();
if (runs != null) {
String groupText = search;
for (int i=0 ; i<runs.size(); i++) {
XWPFRun r = runs.get(i);
String text = r.getText(0);
if (text != null)
if(text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text, 0);
}
else if(groupText.startsWith(text)){
group.add(i);
groupText = groupText.substring(text.length());
if(groupText.isEmpty()){
runs.get(group.get(0)).setText(replace, 0);
for(int j = 1; j<group.size(); j++){
p.removeRun(group.get(j));
}
group.clear();
groupText = search;
}
}else{
group.clear();
groupText = search;
}
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text);
}
}
}
}
}
}
}

For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words. What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
}
}
I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text. For me one of a additional advantage was that visualy I'm able to faster find placeholders in text.
So finally above loop looks like that:
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
r.isBold(false);
}
}
I hope it will help to someone, while I spend too much time for that :)

I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.

To be sure that a word will be consider as a single XWPFRun,
You can use merge_field as variable in word like that
Place cursor on the word you want to be a single run.
Press CTRL and F9 together and { } in gray will appear.
Right-click on the { } field and select Edit Field.
In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
Click OK.

How can I test if a CK Editor field is empty

I want to test if a CKEditor ( Rich Text ) field is empty as part of some business logic.
I do not want to use the built in validation features.
If a CK Editor field has previously had text and then this text is deleted there is still content e.g.
<p dir="ltr">
</p>
I can get a handle to this text string using :
dataVar = xspdoc.getDocument().getMIMEEntity(dataNamevar).getContentAsText();
Is there a way to test if the CKEditor field is empty of visible text ?

Technically speaking, if it has what amounts to a a single visible newline in it as you've shown in your question, it isn't really "empty".
Realistically, you'll have to parse the content value to find out if there is content that is not either inside tags or the few special characters like and so on.
I tend to do this in js, if I have to, by taking the whole string of text and splitting it into an array based on "<" then taking each element of the array and removing an text to the left of an ">", then trim. That leaves me an array of either empty strings or text that is outside any tags. From there it's easy enough check for any of strings in the array to see if they are not empty, and not " ".
This may be more cumbersome then some built in parser that I don't know, but it's fairly reliable and quick. (and a very similar method can be used in formula language as well).
In ssjs formula you could:
var checkString = #trim(#replacesubstring(#implode( #trim (#right( #explode( sourceHTMLstring , "<" ) , ">" ) ) , " "), " " , ""));
if(checkstring == "") {
// *** You have no content
} else {
// *** you have content
}
Obviously this could be done just as easily in pure javascript, but the old formula language is so ingrained in my head, I'd go this way just out of habit.
** Also note: You may want to check for an <img> tag in there somewhere in case someone has done absolutely nothing other than put an image in the rich text.

CKEditor has its own API, I guess this is the right method to use:
http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.editor.html#getData

This might be helpful: http://xpagetips.blogspot.com/2011/10/be-careful-with-empty-ckeditor-rich.html

Check if CKEditor is empty
For any browser
var editor=CKEDITOR.instances.editorName.getData();

I found best answer for this
function validateCKEDITORforBlank(ckData)
{
ckData = ckData.replace(/<[^>]*>|\s/g, '');
var vArray = new Array();
vArray = ckData.split(" ");
var vFlag = 0;
for(var i=0;i<vArray.length;i++)
{
if(vArray[i] == '' || vArray[i] == "")
{
continue;
}
else
{
vFlag = 1;
break;
}
}
if(vFlag == 0)
{
return true;
}
else
{
return false;
}
}
Link

Detect a change in a rich text field's value in SPItemEventReceiver?

I currently have an Event Receiver that is attached to a custom list. My current requirement is to implement column level security for a Rich Text field (Multiple lines of text with enhanced rich text).
According to this post[webarchive], I can get the field's before and after values like so:
object oBefore = properties.ListItem[f.InternalName];
object oAfter = properties.AfterProperties[f.InternalName];
The problem is that I'm running to issues comparing these two values, which lead to false positives (code is detecting a change when there wasn't one).
Exhibit A: Using ToString on both objects
oBefore.ToString()
<div class=ExternalClass271E860C95FF42C6902BE21043F01572>
<p class=MsoNormal style="margin:0in 0in 0pt">Text.
</div>
oAfter.ToString()
<DIV class=ExternalClass271E860C95FF42C6902BE21043F01572>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt">Text.
</DIV>
Problems?
HTML tags are capitalized
Random spaces (see the additional space after margin:)
Using GetFieldValueForEdit or GetFieldValueAsHTML seem to result in the same values.
"OK," you say, so lets just compare the plain text values.
Exhibit B: Using GetFieldValueAsText
Fortunately, this method strips all of the HTML tags out of the value and only plain text is displayed. However, using this method led me to discover additional issues with whitespace characters:
In the before value:
Sometimes there are additional newline characters.
Sometimes spaces are displayed as non-breaking spaces (ASCII char code 160)
Question:
How can I detect if the user changed a rich text field in an event receiver?
[Ideal] Detect any change to HTML or text or white space
[Acceptable] Detect changes to text or white space
[Not so good] Detect changes to text characters only (strip all non-alphanumeric characters)

What happens if you set the ListItem field with the new value and read it back out? Does that give the same formatting?
object oBefore = properties.ListItem[f.InternalName];
properties.ListItem[f.InternalName] = properties.AfterProperties[f.InternalName]
object oAfter = properties.ListItem[f.InternalName];
//dont update
properties.ListItem[f.InternalName] = oBefore;

I would probably try something between choices 2 and 3:
bool changed =
valueAsTextBefore != valueAsTextAfter ||
0 != string.Compare(
oBefore.ToString().Replace(" ", ""),
oAfter.ToString().Replace(" ", ""),
true);
The left half checks if the text (including case) has changed while the right half checks if the tags or attributes have changed. Very kludgy, but should fit your case.
The only other thing I can think of is to run an XML transform on the HTML in order to standardize on case and spacing. But not only does that seem like overkill, but it assumes the HTML will always be well formed.

I'm currently testing a combination approach: GetFieldValueAsText and then stripping out all characters except alphanumeric/punctuation:
static string GetRichTextValue(string value)
{
if (null == value)
{
return string.Empty;
}
StringBuilder sb = new StringBuilder(value.Length);
foreach (char c in value)
{
if (char.IsLetterOrDigit(c) || char.IsPunctuation(c))
{
sb.Append(c);
}
}
return sb.ToString();
}
This only detects changes to the text of a rich text field but seems to work consistently.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract wrong paragraph direction in Word using Apache POI library - apache-poi

Use below lines to set direction: if ( par.getCTP().getPPr()==null ) { par.getCTP().addNewPPr(); } if ( par.getCTP().getPPr().getBidi()==null ) { par.getCTP().getPPr().addNewBidi(); } par.getCTP().getPPr().getBidi().setVal(STOnOff.ON); //or STOnOff.OFF

Related

Editing XML ProtectedString CDATA

Gradle: How to filter and search through text?

Seperated text line in Apache POI XWPFRun object

How can I test if a CK Editor field is empty

Detect a change in a rich text field's value in SPItemEventReceiver?

Categories

Resources