While trying to parse RSS feeds in Groovy, I found a GPath example using wildcards:
def text = """
<data>
<common-tables>
<table name="address"/>
<table name="phone"/>
</common-tables>
<special-tables>
<table name="person"/>
</special-tables>
<other-tables>
<table name="business"/>
</other-tables>
</data>
"""
def xml = new XmlParser().parse(new ByteArrayInputStream(text.getBytes()))
def tables = xml.'**'.table.findAll{ it.parent().name() ==
"special-tables" || it.parent().name
(from http://old.nabble.com/Q:-Avoiding-XPath---using-GPath-td19087210.html)
It looks like a funny use of the 'spread-dot' operator. I can't find any reference to this on the Groovy site, books, etc.
How does this work, and more importantly, how do you discover this? Is there any XPath to GPath 'Rosetta Stone' out there?
Well, as usual, the best place to find information is in the Groovy source itself.
The result of a parsing is a groovy.util.slurpersupport.GPathResult object.
If you look at the source (plain java file), you'll see that the getProperty(string) method has the following special operators:
".." that returns the parent
"*" that returns all the children
"**" that act as a depth first loop
"#" that is used to access a property
the normal node accessor.
That's all, no other magic keywords for the moment.
All of those strings are treated as properties. None of them are actually operators.
The calls are routed through GPathResult#getProperty which specifically checks for the operators listed in gizmo's answer.
Related
I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.
This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.
Firebug identified xpath not working in protractor.I ahve cretaed xpath using firebug.When I identify the xpath using IDE,it is working fine.However when I use the same xpath in protractor,it is not working.My element does not have id or name.So here i can use only xpath option.
Please find the below image for reference.
Here I need to verify whether that particular element has "IRCTC Attractions" text.
Could you please help me?
HTML code:
//div style="width:100%;" class="g_hedtext">IRCTC Attractions /div
Find the element by text and assert it's present:
var elm = element(by.xpath("//div[. = 'IRCTC Attractions']"));
expect(browser.isElementPresent(elm)).toBe(true);
OK, looking at your error message (in the comment):
Exception loading: SyntaxError:
C:\Users\XXXX\AppData\Roaming\npm\TC_model2.js:7
var disclaimermessage = element(by.xpath('//[#id='disclaimer-message']'));
^^^^^^^^^^ Unexpected identifier
(I'm guessing where the carets before "Unexpected identifier" were aligned. Is that right?)
The problem is that you've used single quotes both to delimit the string 'disclaimer-message', and to delimit the whole XPath expression '//[#id='disclaimer-message']'. Thus it appears to the parser that your XPath expression is the stuff between the first two single quotes: '//[#id=', and then the disclaimer-message is some other identifier without any comma or other operator to show what it's doing there.
The solution is to use double quotes inside the XPath expression. XPath accepts either single or double quotes; it doesn't care, as long as you match them with each other. So change the offending line to
var disclaimermessage = element(by.xpath('//[#id="disclaimer-message"]'));
And you should be good to go.
For future reference, this question would have been quicker and easier to answer if you had told us about the error message in the first place.
We would like to understand a couple of legacy job-dsl scripts but don't know what "slash operator" means in this context (as it cant be division):
def command = (shells.first() / command)
We have tried to look it up in several Groovy books but only found the trivial solution that it means 'division'.
It's an XML Node operation, to return a sub-node of a XML node, or create it if it doesn't exist. Probably the command node under the first of your shells nodes here.
Groovy allows operator overloading, so it is the same "division" operator, just redefined somewhat. This is common (but also controversial) in other languages allowing operator overloading, but does allow for richer DSLs.
Having had a quick look at (an old copy of) the JobDSL source, it seems that they're doing it using a class NodeEnhancement, notably this JavaDoc:
/**
Add div and leftShift operators to Node.
div - Will return the first child that matches name, and if it doesn't exists, it creates
...
**/
Using XPath 1.0 in XSLT SharePoint 2013, I have two objectives:
To extract 'Library Name' from:
/path/to/library/could/be/any/length/Library Name/file.extension
To extract document id QYZM2HKWQCSZ-3-3 from the following:
http://sharepoint01/sites/temp/_layouts/15/DocIdRedir.aspx?ID=QYZM2HKWQCSZ-3-3, QYZM2HKWQCSZ-3-3
How to extract the desired strings?
OAN, for some reason Document Id column return the full blown path to the resource as opposed to Id only.
Any suggestions, how to get Id only (to avoid substring preprocessing)?
For (1), write a recursive template as follows. Pass the input string as a parameter.
(a) if not(contains(substring-after($input, '/'), '/')) then return substring-before($input, '/')
(b) otherwise, make a recursive call passing (substring-after($input, '/')) as the parameter.
(c) add some error-handling logic to make sure you terminate if the input string doesn't contain a '/'.
Not possible in plain XPath 1.0 if you cannot find any further pattern in the path.
It seems the pattern ?ID= is fixed, and also the colon following the ID. If so, you can use substring-after(substring-before(., ','), '?ID='). Replace the context . by some XPath expression selecting the string.
I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?
Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")
As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.