XPath innerText ignoring subchilds

XPath innerText ignoring subchilds - text

I want to access an element using text() attribut of xpath having a structure like shown below.
<root>
<child>
<lowerchild>
<lowestchild>
My text
</lowestchild>
</lowerchild>
</child>
</root>
.
//child[contains(text(), 'My text')]
should return the child-element. and
//lowerchild[contains(text(), 'My text')]
should return the lowerchildelement.
I tried out the XPath-commands with HTMLAgilityPack, but they were not able to find those elements.
The final result of my little project is a small xpath-searcher, so the user gives the name of element the attribut and the value, so it would be great if you might give me a solution only using that information. It could be any random structure. if element names double themselves like if we had 2 lowestchild-elements, than i would like to pick the "lower" one of the lowest. Hope you can help me.

Instead of
//child[contains(text(), 'My text')]
it looks like you want
//child[contains(., 'My text')]
The XPath expression text() (with the implicit child:: axis) selects any text node that is a child of the context node. In the above example, it selects only text nodes that are immediate children of the child element. In the XML you showed, the child element has two child text nodes, with the lowerchild element in between them. Both text nodes contain only whitespace, and for this reason they may be stripped by some processors, depending on settings.
If you pass a node-set or a sequence as the first parameter to contains(a, b), it takes the first node and converts it to a string. So your parameter is getting converted to a string containing only whitespace, or else an empty string (if the whitespace-only text nodes got stripped).
But if instead of text() you pass . as the first argument to contains(), then the context node (which is a child) gets converted to a string. This means concatenating the values of all text node descendants of child, not just immediate text node children. (It's sort of like DOM innerText, which your question title mentions, but does not include start/end tags of elements, nor attributes.) For this reason, //child[contains(., 'My text')] will return the child element.

Related

Is there a way, in pexpect to get the content of parenthesized sub-expressions

I'm analyzing data coming from a program with pexpect. Is there a way to get an array with the parts matching parenthesized sub-expressions ? E.g.:
p.expect("IP: *([0-9]*)\.([0-9]*)\.([0-9]*)\.([0-9]*) *\r\n")
I'd like to get a list or a tuple with the four fields of my IP.

Pexpect sets a match attribute on the spawn, which is a plain old re.Match instance. So, look in the groups i.e. p.match.groups() for the matched octets after a successful p.expect call.

How to build text from mixed xml content using Python?

I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.

This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.

Program in to generate chained list in python

I am developing a program in python and as part of it I have to link all the lists that have an element in common in a certain position, that is, there is an input element and an output element and I want to gather all those that follow the chain. For example, we have as input a list :
list_array = [[n_element, l_input, l_ouput], .....]
A concrete example would be
list_array = [[1,a,b],[2,c,d],[3,e,f],[4,b,e],[5,d,f],[6,a,e],[7,b,c]
The result of the program should be a list where the elements are linked by input and output.
res_array = [[1,4,3],[1,7,2,5],[6,3]]
The result of the program should be a list where the elements are linked by input and output. If there is one element included in another, the element with greater length prevails. My first thought was to use a tree structure, a search in depth or length. I need ideas.

You could use a graph representation of this problem:
For each element [n_element, l_input, l_output], you add vertices l_input and l_output to your graph (if not already present) and add an edge labelled n_element (from l_output to l_input).
Then, you look for paths through that graph. The resulting list is then given by the concatenation of edge labels.

How to get fields of a Julia object

Given a Julia object of composite type, how can one determine its fields?
I know one solution if you're working in the REPL: First you figure out the type of the object via a call to typeof, then enter help mode (?), and then look up the type. Is there a more programmatic way to achieve the same thing?

For v0.7+
Use fieldnames(x), where x is a DataType. For example, use fieldnames(Date), instead of fieldnames(today()), or else use fieldnames(typeof(today())).
This returns Vector{Symbol} listing the field names in order.
If a field name is myfield, then to retrieve the values in that field use either getfield(x, :myfield), or the shortcut syntax x.myfield.
Another useful and related function to play around with is dump(x).
Before v0.7
Use fieldnames(x), where x is either an instance of the composite type you are interested in, or else a DataType. That is, fieldnames(today()) and fieldnames(Date) are equally valid and have the same output.

suppose the object is obj,
you can get all the information of its fields with following code snippet:
T = typeof(obj)
for (name, typ) in zip(fieldnames(T), T.types)
println("type of the fieldname $name is $typ")
end
Here, fieldnames(T) returns the vector of field names and T.types returns the corresponding vector of type of the fields.

XSLT: XPath 1.0 substring

Using XPath 1.0 in XSLT SharePoint 2013, I have two objectives:
To extract 'Library Name' from:
/path/to/library/could/be/any/length/Library Name/file.extension
To extract document id QYZM2HKWQCSZ-3-3 from the following:
http://sharepoint01/sites/temp/_layouts/15/DocIdRedir.aspx?ID=QYZM2HKWQCSZ-3-3, QYZM2HKWQCSZ-3-3
How to extract the desired strings?
OAN, for some reason Document Id column return the full blown path to the resource as opposed to Id only.
Any suggestions, how to get Id only (to avoid substring preprocessing)?

For (1), write a recursive template as follows. Pass the input string as a parameter.
(a) if not(contains(substring-after($input, '/'), '/')) then return substring-before($input, '/')
(b) otherwise, make a recursive call passing (substring-after($input, '/')) as the parameter.
(c) add some error-handling logic to make sure you terminate if the input string doesn't contain a '/'.

Not possible in plain XPath 1.0 if you cannot find any further pattern in the path.
It seems the pattern ?ID= is fixed, and also the colon following the ID. If so, you can use substring-after(substring-before(., ','), '?ID='). Replace the context . by some XPath expression selecting the string.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string