How to build text from mixed xml content using Python? - python-3.x

I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.

This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.

Related

Handling an invalid XML attribute in an Excel document

I'm using openpyxl to read an Excel document. For reasons that I don't understand at all, two of the cell-style names have a ctrl-d in them in xl/styles.xml in the ZIP archive storing the spreadsheet:
<cellStyle name="^D" xfId="20" builtinId="53" customBuiltin="true"/>
<cellStyle name="^D 2" xfId="21" builtinId="53" customBuiltin="true"/>
(That's a ctrl-D in both names.) Openpyxl's load_workbook function quite reasonably chokes with the following error:
lxml.etree.XMLSyntaxError: invalid character in attribute value, line 2, column 11879
Approaches that I've considered:
Preprocess and replace styles.xml
Ignore styles altogether somehow
Manually remove the cell styles in oocalc (or Excel)
Any ideas/advice?
Shoot whoever or whatever produced the file because this is invalid XML! ;-) Submit a bug upstream.
If you can clean it up in MS Excel then that's going to be eaiser otherwise you can write your own preprocessor using openpyxl's code: styles/stylesheet.py will let you read the source without having to worry about namespaces but otherwise you should be able to change the elements inplace. Stylesheets are almost never that big (some libraries do produce some massive ones with junk in them).

Switch Case Conditional output in Latex

I want to write a report which has a structure like this:
\begin{document}
\input[option=a]{class}
\input[option=b]{class}
\input[option=c]{class}
\input[option=d]{class}
\end{document}
class.tex has content like this:
here are some shared content
switch(option)
case a
some text a
case b
some text b
case c
some text c
case d
some text d
endswitch
Here maybe more shared content.
Is there any way to do this in Latex?
A simplified way of doing this could be with logic statements using if else fi logic
at the top of the .tex file set up a switch with
\newif\ifswitch
The default value will be false. To set the value to be true use
\switchtrue
Then in the text of the document use
\ifswitch
<<text to include if switch is true>>
\else
<<text to include if switch is false>>
\fi % ends the if statement
So for your particular question you could have a set of switches
\newifConditionA
\newifConditionB
\newifConditionC
\newifConditionD
This is not as elegant as using a switch statement, but allows conditions where you want text from A and C at the same time for example.
Reference where this is discussed is here for
two versions of a document with 'if else' logic statements
You can use the following (crude) method of identifying the textual components between which you want to extract stuff from a file:
\documentclass{article}
\usepackage{filecontents}
\begin{filecontents*}{class.tex}
switch(option)
case a
some text a
case b
some text b
case c
some text c
case d
some text d
endswitch
\end{filecontents*}
\usepackage{catchfile}
% \inputclass{<file>}{<from>}{<to>}
\newcommand{\inputclass}[2]{%
\CatchFileDef{\class}{class.tex}{}%
\long\def\classsegment##1#1 ##2 #2##3\relax{##2}%
\show\classsegment
\expandafter\classsegment\class\relax
}
\begin{document}
\inputclass{case c}{case d}
\inputclass{case a}{case b}
\inputclass{case d}{endswitch}
\inputclass{case b}{case c}
\end{document}
Related:
How to extract information between two unique words in a large text file
How to extract data between two different xml tags
\input only part of a file
The last one is a more adaptable approach using the catchfilebetweentags package. This requires the insertion of appropriate tags within your code, which might not be as helpful. You could also use listings to include specific lines of code from an external file.
As I understand it, what you want is to update the function for each different part of the text, while defining the function in just one place.
The easy way to do this is to renew a variable command at the start of each section.
At start:
\newcommand{\VARIABLENAME}{VARIABLE_1}
At section:
\renewcommand{\VARIABLENAME}{VARIABLE_2}
There are more advanced ways of doing this as well, involving defining variables but for all it is worth, this is more readable and simpler to implement.
Note: If you are planning to make something more dynamic then just a class, I recommend implementing something in another language such as python to write the file in LaTex as it usually gives a lot more room for modification.

Comparing strings in python 2.7

This is my code:
for films in filmlist:
with codecs.open('peliculas.txt', encoding='utf8', mode='r') as lfile:
filmsDone = lfile.read()
filmsDoneList = filmsDone.split(',')
if films not in filmsDoneList:
with codecs.open('peliculas.txt', encoding='utf8', mode='a+') as lfile:
lfile.write(films.strip() + ',')
It will never recognize the last item of the list.
I have printed filmsDoneList and the last item in PyCharm looks like this: u'X Men.Primera Generacion'. I have printed films and they looks like this: X Men.Primera Generacion'
So I have no idea where is the problem. Thanks in advance.
#Rafa, for you to better understand what I meant in the comments, I had to write an entire answer in order for me to attach codes and screenshots.
Let's say the peliculas.txt file has the following format:
You can import such file in python according the following 3 commands:
fileIN=open('peliculas.txt','r')
filmsDoneList=fileIN.readlines()
fileIN.close()
So you basically open the file, import each line thanks to readlines() and then close the file because its contents are available in filmsDoneList. The latter has the following contents (in PyCharm):
Obviously this list is quite long and does not fit in my screen, but you get the point.
You can now get rid of that annoying newline tag '\r\n' by means of the following loop:
for id in range(len(filmsDoneList)):
filmsDoneList[id]=filmsDoneList[id].strip()
and now filmsDoneList has the form:
much better now, innit?
Now, let's say you want to add the following films:
newFilms=['The Exorcist','Back to the Future','Aliens','Back to the Future']
To make your code more robust, I have added Back to the Future twice. Basically you can get rid of duplicates in newFilms by means of the set() function. This will convert newFilms in a set with duplicates removed, but we will convert it back to a list thanks to this command:
newFilms=list(set(newFilms))
and now newFilms has the form:
Now that everything has been sorted, it's time to check if items in newFilms already are in filmsDoneList which, recall, is the contents of peliculas.txt.
Reopen peliculas.txt as follows:
fileOUT=open('peliculas.txt','a')
the 'a' tag means "append", so basically everything you write will be added to the file without removing anything from it.
And the main loop goes:
for film in newFilms:
if film in filmsDoneList:
pass
else:
fileOUT.write(film+'\n')
the pass means "do nothing". The write commands also appends the newline tag to the movie title: this will keep the previous format of 1 title per line. At the end of this loop you might as well close fileOUT.
The resulting peliculas.txt is
and, as you can see, Back to the Future was in newFilms but wasn't appended to the end of this file because already was in it. As instead, The Exorcist and Aliens have been appended to this file, at the bottom.
If your file has titles separated by commas, this approach is still valid. However you must add
filmsDoneList=filmsDoneList[0].split(',')
after the first for loop. Also in the write function (in the last for loop) you might want to replace the newline value with a comma.
This approach is cleaner, I reckon will also fix the problem you've been having and avoids continuous open/close files in a loop. Hope this helps!

issues while serializing to YAML file

I have started using .net API for yaml and it seems to be helpful. However I have few questions and wondering if you can provide some sample/work around for the same.
(1) I have an object consisting 4 strings I would like to serialize its collection (List or String[]). I wrote a helper method to return me the strings in the format I want, however it adds an extra single quote before and after the string. So I am getting
-'{str1: str2, str3: str4}'
-'{str5: str6, str7: str8}'
instead of
-{str1: str2, str3: str4}
-{str5: str6, str7: str8}
Can you suggest any workarounds?
(2) I am trying to insert xaml as a string in a yaml document. My xaml is well formed xml but when I serialize it, it cuts before 3rd last element. Any idea why?
Regarding the first question, if you are serializing an array of strings, then it is normal that each element is quoted because it starts with a '{'. In this case, you should be serializing the list of objects directly instead of converting them to string first.
Regarding the second question, you should add some code to the question to clarify what you are doing.

XPath innerText ignoring subchilds

I want to access an element using text() attribut of xpath having a structure like shown below.
<root>
<child>
<lowerchild>
<lowestchild>
My text
</lowestchild>
</lowerchild>
</child>
</root>
.
//child[contains(text(), 'My text')]
should return the child-element. and
//lowerchild[contains(text(), 'My text')]
should return the lowerchildelement.
I tried out the XPath-commands with HTMLAgilityPack, but they were not able to find those elements.
The final result of my little project is a small xpath-searcher, so the user gives the name of element the attribut and the value, so it would be great if you might give me a solution only using that information. It could be any random structure. if element names double themselves like if we had 2 lowestchild-elements, than i would like to pick the "lower" one of the lowest. Hope you can help me.
Instead of
//child[contains(text(), 'My text')]
it looks like you want
//child[contains(., 'My text')]
The XPath expression text() (with the implicit child:: axis) selects any text node that is a child of the context node. In the above example, it selects only text nodes that are immediate children of the child element. In the XML you showed, the child element has two child text nodes, with the lowerchild element in between them. Both text nodes contain only whitespace, and for this reason they may be stripped by some processors, depending on settings.
If you pass a node-set or a sequence as the first parameter to contains(a, b), it takes the first node and converts it to a string. So your parameter is getting converted to a string containing only whitespace, or else an empty string (if the whitespace-only text nodes got stripped).
But if instead of text() you pass . as the first argument to contains(), then the context node (which is a child) gets converted to a string. This means concatenating the values of all text node descendants of child, not just immediate text node children. (It's sort of like DOM innerText, which your question title mentions, but does not include start/end tags of elements, nor attributes.) For this reason, //child[contains(., 'My text')] will return the child element.

Resources