Structure of docx field

Structure of docx field - xsd

A field in docx is represented this way.
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
AAA
<w:r>
<w:instrText xml:space="preserve"> NOTEREF _Ref111111 \h </w:instrText>
</w:r>
BBB
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
CONTENT
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
The field content goes to the CONTENT placeholder. My question is: can anything go to AAA or BBB? Or they are always empty? I suspect the creators of this format had something in mind to have four separator elements instead of just two, but I haven't seen any examples of using this.

It's better to think of it as only three separator elements and two slots for content, which can be complex thanks to the separators.
<w:r><w:fldChar w:fldCharType="begin"/></w:r>
LABEL
<w:r><w:fldChar w:fldCharType="separate"/></w:r>
VALUE
<w:r><w:fldChar w:fldCharType="end"/></w:r>
So your AAA and BBB are just extra content for the LABEL.
There's an example in the spec, where LABEL is:
<w:r><w:rPr><w:b/><w:color w:val="ED1C24"/><w:u w:val="single"/></w:rPr>
<w:instrText>D</w:instrText></w:r>
<w:r><w:instrText xml:space="preserve">ATE</w:instrText></w:r>
to make the D in DATE a different style.

Related

Excel - parent-child from a list of values

I've a list a values (column-1) which contains multiple parents and childs levels and I need to associate in another column which is the parent of each cell.
Any ideas on how to do that easily in excel?
really thanks!
COLUMN-1 COLUMN-2
**A
A.01** **A**
**A.01.01** **A.01**
A.01.01.01 **A.01.01**
A.01.01.01.01 A.01.01.01
A.01.01.01.02 A.01.01.01
A.01.01.01.03 A.01.01.01
A.01.01.01.04 A.01.01.01
A.01.01.02 **A.01.01**
A.01.01.02.01 A.01.01.02
A.01.01.02.02 A.01.01.02
A.01.01.02.03 A.01.01.02
A.01.01.02.04 A.01.01.02
A.01.01.03 **A.01.01**
A.01.01.03.01 A.01.01.03
A.01.01.03.02 A.01.01.03
A.01.01.03.03 A.01.01.03
A.01.01.03.04 A.01.01.03
FINAL GOAL

Jos Woolley's approach looks likely.
For how to do it, see How can I perform a reverse string search in Excel without using VBA? (adapating . for spaces)

Netsuite Custom Field with REGEXP_REPLACE to strip HTML code except carriage return

I have a custom field with some HTML code in it:
<h1>A H1 Heading</h1>
<h2>A H2 Heading</h2>
<b>Rich Text</b><br>
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd<br>
<ol><li>numbered list</li><li>fgdsfsd f sa</li></ol>Another List<br>
<ul><li>bulleted</li></ul>
I also have another non-stored field where I want to display the plain text version of the above using REGEXP_REPLACE, while preserving the carriage returns/line breaks, maybe even converting <br> and <br/> to \r\n
However the patterns etc... seem to be different in NetSuite fields compared to using ?replace(...) in freemarker... and I'm terrible with remembering regexp patterns :)
Assuming the html text is stored in custitem_htmltext what expression could i use as the default value of the NetSuite Text Area custom field to display the html code above as:
A H1 Heading
A H2 Heading
Rich Text
fsdfafsdaf df fsda f asdfa f asdfsa fa sfd
etc...
I understand the bulleted or numbered lists will look crap.
My current non-working formula is:
REGEXP_REPLACE({custitem_htmltext},'<[^<>]*>','')
I've also tried:
REGEXP_REPLACE({custitem_htmltext},'<[^>]+>','') - didn't work

When you use a Text Area type of custom field and input HTML, NetSuite seems to change the control characters ('<' and '>') to HTML entities ('<' and '>'). You can see this if you input the HTML and then change the field type to Long Text.
If you change both fields to Long Text, and re-input the data and formula, the REGEXP_REPLACE() should work as expected.

From what I have learned recently, Netsuite encodes data by default to URL format, so from < to < and > to >.
Try using triple handlebars e.g. {{{custitem_htmltext}}}
https://docs.celigo.com/hc/en-us/articles/360038856752-Handlebars-syntax
This should stop the default behaviour and allow you to use in a formula/saved search.

obtain en-US title tag text

I'm trying to obtain the text in only the title#lang=en-US elements in an XML file.
This code obtains all the title text for all languages.
entries = root.xpath('//prefix:new-item', namespaces={'prefix': 'http://mynamespace'})
for entry in entries:
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
print (title.text)
I tried this code to get the title#lang=en-US text, but it does not work.
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
test = title.xpath("#lang='en-US'")
print (test)
How do I obtain the text for only the english language items?

The expression
//prefix:title[lang('en')]
will select all the English-language titles. Specifically:
title elements that have an xml:lang attribute identifying the title as English, for example <title xml:lang="en-US"> or <title xml:lang="en-GB">
title elements within some container that identifies all the contents as English, for example <section xml:lang="en-US"><title/></section>.
If you specifically want only US English titles, excluding other forms of English, then you can use the predicate [lang('en-US')].

docx4j reports differences on unchanged table data

I have created a *.docx file with a 2x2 table, each cell containing the text Cell x-y where x=row number and y=column number.
When I pass this document through a simple transformation process, docx4j's Differencer.diff() method reports no differences (i.e. no w:ins or w:del tags).
This is expected and handled cleanly, inspite of the fact that the .docx has the text of the original document broken up like this inside the <w:tc> -> <w:p> tags:
<w:r>
<w:t>Cell</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> 1-1</w:t>
</w:r>
and this in the transformed document:
<w:r>
<w:t xml:space="preserve">Cell 1-1</w:t>
</w:r>
However, if I add the text "Table Title" above the table in the document, the contents of the original document (Word's handling, nothing I can do about it) cells merges into one <w:r>:
<w:r>
<w:t>Cell 1-1</w:t>
</w:r>
And the only difference in the transformed document is that xml:space="preserve" is inserted:
<w:r>
<w:t xml:space="preserve">Cell 1-1</w:t>
</w:r>
However, docx4j's Differencer.diff() method now reports that the content of each cell is inserted, and shows the following as the content of each w:tc's w:p in the generated diff document:
<w:ins xmlns:xalan="http://xml.apache.org/xalan" xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage" w:date="2009-03-11T17:57:00Z" w:author="someone" w:id="1">
<w:r>
<w:t xml:space="preserve">Cell 1-1</w:t>
</w:r>
</w:ins>
and shows the content of each cell as deleted, immediately following the closing <w:tbl> tag:
<!--Handling simple deleted w:p-->
<w:p xmlns:xalan="http://xml.apache.org/xalan" xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
<w:del w:date="2009-03-11T17:57:00Z" w:author="someone" w:id="5">
<w:r>
<w:delText>Cell 1-1
</w:r>
</w:del>
</w:p>
I know that the Differencer is capable of ignoring the xml:space="preserve" attributes because it does so with the inserted text before the table, so I doubt that's the cause.
Are these table scenarios outside the intended use case for the Differencer? Is it an error in usage / invocation? Bug?
Any guidance is appreciated.

XSD: how to define type with (1) attributes, (2) nested elements and (3) plain text content?

Is it possible to define such element as HTML's "font" tag, which can contain all three types of subelements?
For example, I can write
<font size=3>This is <b>the</b> text</font>
How can I define is XSD, that font can contain:
1) attribute size
2) nested element B
3) text arount it
?
Thanks

Define the type as Content Type Mixed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Structure of docx field - xsd

Related

Excel - parent-child from a list of values

Netsuite Custom Field with REGEXP_REPLACE to strip HTML code except carriage return

obtain en-US title tag text

docx4j reports differences on unchanged table data

XSD: how to define type with (1) attributes, (2) nested elements and (3) plain text content?

Categories

Resources