I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.
This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.
I will be receiving the following XML data in a variable.
<order>
<name>xyz</name>
<city>abc</city>
<string>aGVsbG8gd29ybGQgMQ==</string>
<string>aGVsbG8gd29ybGQgMg==</string>
<string>aGVsbG8gd29ybGQgMw==</string>
</order>
Output:
<order>
<name>xyz</name>
<city>abc</city>
<string>hello world 1</string>
<string>hello world 2</string>
<string>hello world 3</string>
</order>
I know how I can decode from base64 but the problem is some of the values are decoded already and some are encoded. What is the best approach to decode this data using groovy so that I get the output as shown?
Always: tag value will be encoded. rest all other tags and value will be decoded.
Since there's no uncertainty on which nodes could come encoded and which not, hence no need to detect base64 encoding, the way to do it is pretty simple:
Parse it. There's two preferable ways to do that in Groovy: XmlSlurper & XmlParser. They differ in computation & mem consumption modes, both provide object/structure representation in the end, though.
Work with that object structure: traverse all required elements, decode the content/attributes you need to decode.
Either proceed further with the data with them and/or serialize it back to the XML text.
Articles to look at:
Load, modify, and write an XML document in Groovy
https://www.baeldung.com/groovy-xml
https://groovy-lang.org/processing-xml.html
and many, many more.
Another cheat sheet always useful for Groovy noobs: http://groovy-lang.org/groovy-dev-kit.html
Check out how to traverse the structures there, for instance.
How does one convert an ASTNode (or at least a CompilationUnit) into a valid piece of source code?
The documentation says that one shouldn't use toString, but doesn't mention any alternatives:
Returns a string representation of this node suitable for debugging purposes only.
CompilationUnits have rewrite, but that one does not work for ASTs created by hand.
Formatting options would be nice to have, but I'd basically be satisfied with anything that turns arbitrary ASTNodes into semantically equivalent source code.
In JDT the normal way for AST manipulation is to start with a basic CompilationUnit and then use a rewriter to add content. Then ASTRewriteAnalyzer / ASTRewriteFormatter should take care of creating formatted source code. Creating a CU just containing a stub type declaration shouldn't be hard, so that's one option.
If that doesn't suite your needs, you may want to experiement with directly calling the internal org.eclipse.jdt.internal.core.dom.rewrite.ASTRewriteFlattener.asString(ASTNode, RewriteEventStore). If not editing existing files, you may probably ignore the events collected in the RewriteEventStore, just use the returned String.
I'm interested in knowing the data structure that a phonebook would use. One that contains objects with fields like a name string, a number string, etc. and allows searching (and partial searching, like the first few letters of the name) via ALL the fields.
What is the method that a phonebook would use? I was thinking it would be some version of a tree, but I'm having difficulty wrapping my head around efficient methods of doing so.
You could use an Array of Maps:
ArrayList<Map<String, String>> a;
// ...
a.get(i).get("name")
But XML is much better:
org.w3c.dom is quite easy to use and XML is extremely simple to save to a file etc.
<contacts>
<contact name="..." phone="..." />
</contacts>
or
<contacts>
<contact>
<name>...</name>
<phone>...</phone>
</contact>
</contacts>
guys
I'm using schematron and I need to do the following:
Sometimes in the xml document I want to validate, there's elements like this:
<Var.X name="B">
For these elements (which name() has a dot in the middle) I need to see in the xml file if there's a diretory named Var with a child element with the attribute name = X (in this case), like this:
<Var>
<Obj name="X">
</Var>
I thought of transforming the name() of those objects to a string representing the path, so for this case particularly:
Var.X would be /*/Var/child::*[#name="X"]
Having this string, then I wanted to check if there's, actually, an element belonging to the path the string represents, but I can't cast the string to path type, and I don't even know if that's possible...
Is there a simpler way of doing this?
You can also use the name-function without an saxon-Extension!
<rule context="*[matches(name(),'\w\.\w')]">
<let name="beforePoint" value="substring-before(name(),'.')"/>
<let name="afterPoint" value="substring-after(name(),'.')"/>
<assert test="/*/*[name() = $beforePoint]/*[#name=$afterPoint]">error message</assert>
</rule>
I've realised that what I wanted to achieve is done with saxon:evaluate function... and I already achieved what I wanted