Nested GPath expressions with XmlSlurper and findAll - groovy

I'm trying to analyse an XML tree using XmlSlurper and GPath, and the behaviour of the findAll method confuses me.
Say, for example, that you have the following XML tree:
<html>
<body>
<ul>
<li class="odd"><span>Element 1</span></li>
<li class="even"><span>Element 2</span></li>
<li class="odd"><span>Element 3</span></li>
<li class="even"><span>Element 4</span></li>
<li class="odd"><span>Element 5</span></li>
</ul>
</body>
</html>
Assuming that xml has been initialised through one of XmlSlurper's parse methods, the following code executes as one would expect:
// Prints:
// odd
// odd
// odd
xml.body.ul.li.findAll {it.#class == 'odd'}.#class.each {println it.text()}
On the other hand:
// Doesn't print anything.
xml.body.ul.li.findAll {it.#class == 'odd'}.span.each {println it.text()}
I'm struggling to understand why I can use the special # property (as well as others, such as **), but not 'normal' ones.
I've looked at the API code, and what confuses me even more is that the getProperty implementation (found in GPathResult) seems to support what I'm trying to do.
What am I missing?

You need to iterate over every span, so you can use the spread-dot operator:
xml.body.ul.li.findAll {it.#class == 'odd'}*.span.each {println it.text()}

Related

Dom Traversal with lxml into a graph database and pass on ID to establish a complete tree

I'm inserting hierarchical data made of a DOM Tree into a graph database but, I'm not able to establish the complete relationship between the nodes. I while looping I ended up truncating the trees
Below is the code that illustrates a traversing of DOM nodes, inserting the tags and obtaining the last inserted id. The problem I'm having is how to properly connect the trees by passing the ID obtained from the previous iteration.
I had the same issue when I used a recursion.
How do I loop and pass the IDs so they can be evenly connected to all the node trees?
Considering the following HTML
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Document</title>
</head>
<body>
<ul class="menu">
<div class="itm">home</div>
<div class="itm">About us</div>
<div class="itm">Contact us</div>
</ul>
<div id="idone" class="classone">
<li class="item1">First</li>
<li class="item2">Second</li>
<li class="item3">Third</li>
<div id="innerone"><h1>This Title</h1></div>
<div id="innertwo"><h2>Subheads</h2></div>
</div>
<div id="second" class="below">
<div class="inner">
<h1>welcome</h1>
<h1>another</h1>
<h2>third</h2>
</div>
</div>
</body>
</html>
With the current python code, I ended up with the truncated tree as illustrated. I omitted the graph Database driver. in order to focus on the cypher since most graph database follows almost the same cypher query.
import json
from lxml import etree
from itertools import tee
from lxml import html
for n in dom_tree.iter():
cursor = Cypher("CREATE (t:node {tag: %s} ) RETURN id(t)", params=(n.tag,))
parent_id = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"Parent:{n.tag}")
for x in n.iterchildren():
cursor = Cypher("CREATE (x:node {tag: %s} ) RETURN id(x)", params=(x.tag,))
xid = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"--------{x.tag}")
cx = Cypher("MATCH (p:node),(k:node) WHERE id(p) = %s AND id(k) = %s CREATE (p)-[:connect {name: p.name+ '->'+k.name}]->(k)", params=(eid, xid,))
ElementTree provides a method, getiterator(), to iterate over every element in the tree.
As you visit each node, you can also get a list of its children with element.getchildren() or its parent with element.getparent().
from lxml import html
tree = html.parse("about.html")
for element in tree.getiterator():
if parent := element.getparent():
print(f"The element {element.tag} with text {element.text} and attributes {element.attrib} is the child of the element {parent.tag}")
In your case you crate tag nodes for the parent time and again, you need to pass on the identifier for the parent node for each level down.
Either you have to take a recursive approach where you pass in the id of the parent to the function you call recursively.
Or you need to assign the elements in your dom tree an id e.g. based on level and position (sibling) to identify them uniquely.
in neo4j you can also use
MATCH (n) WHERE id(n) = $id
MERGE (n)-[:CHILD]->(m:Node {tag:$tag)
to create the child within the context of the parent but that doesn't work with duplicate tags.
I'd go with the recursive approach.

jquery / cheerio: how to select multiple elements?

I need to parse some markup similar to this one, from an html page:
<a href="#">
<i class="icon-location"></i>London
</a>
I need to get London.
I did try something like (using cheerio):
$('a', 'i[class="icon-location"]').text();
or
$('a > i[class="icon-location"]').text();
without success...
I'd like to avoid methods like next(), since the expression should be passed to a method which just extracts the text from the selector.
What expression should I use (if it's feasible) ?
There's a solution, which is pretty unusual, but it works :
$("#foo")
.clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
Demo : https://jsfiddle.net/2r19xvep/
Or, you could surround your value by a new tag so you just select it:
<i class="icon-location"></i><span class="whatever">London</span>
Then
$('.whatever').text();
$('a').text();
will get text as 'London'.
$("a .icon-location").map(function(){
return $(this).text()
}).get();

Proper equality of GPathResults

I need to traverse through a XML and distinguish elments based on their parents. I use Groovy and XmlSlurper.
I know that the GPathResults implements equals() as equality of text() nodes only. Sadly, thats not usable in my case.
Using cmp via is() seems to be pointless since every time you get new results object. I'm a newb in Groovy, so I don't feel like overloading the equals() method.
In this case I'd like to distinguish between those elements by their parent(). Let's say I got GPathResults of element 'b' stored in a variable. How can I get that particular element 'a' "which got that stored element 'b' as its NEAREST parent"?
def xml = ''' <root>
<a type="1"/>
<a type="2"/>
<b>
<a type="1"/>
</b>
</root>
'''.trim()
def slurper = new XmlSlurper(false, false).parseText(xml)
def myParticularB = slurper.b
def wantedA = slurper.depthFirst().find { seg ->
seg.name() == 'a' && seg.#type == '1' && seg.parent() == myParticularB
}
assert (wantedA.parent().name() == 'b') == true
I'm sorry if I overlooked something obvious.
//A corner case
<root>
<a type="1"/>
<a type="2"/>
<b>
<a type="1"/>
<b>
<a type="1"/>
<b>
<a type="1"/>
</b>
</b>
</b>
</root>

Jade mixin trouble

I'm using jade's mixin and got some trouble:
code:
mixin renderLink(linkName,linkUrl,linkClass,other)
- var active = req.url==linkUrl?'active':''
li(class=[active,linkClass])
a(href=linkUrl) #{linkName}
#{other}
....
.nav-collapse
ul.nav
+renderLink('HOME','/')
+renderLink('CHAT','/chat',null,'span.badge.badge-warning 2')
what I want is:
li
a(href="#")
CHAT
span.badge.badge-warning 2
how to modify #{other} to get what I want?
thanks
---thanks, use this:
mixin renderLink(linkName,linkUrl,linkClass)
- var active = req.url==linkUrl?'active':''
li(class=[active,linkClass])
a(href=linkUrl) #{linkName}
block
and got what I want:
<li class=" ">
消息<span class="badge badge-warning">2</span>
</li>
Well first of all, I'm assuming you want CHAT on the same line as a since you don't want a <chat></chat> element.
It's not documented (in the official docs), but what you want is to use a block. Try this:
mixin renderLink(linkName,linkUrl,linkClass,other)
- var active = req.url==linkUrl?'active':''
li(class=[active,linkClass])
a(href=linkUrl) #{linkName}
if block
block
....
.nav-collapse
ul.nav
+renderLink('HOME','/')
+renderLink('CHAT','/chat')
span.badge.badge-warning 2
I'm not sure if the if block statement is necessary.

How I can access elements via a non-standard html property?

I'm try to implement test automation with watir-webdriver. By the way I am a freshman with watir-webdriver, ruby and co.
All our HTML-entities have a unique HTML-property named "wicketpath". It is possible to access the element with "name", "id" a.s.o, but not with the property "wicketpath". So I tried it with XPATH but I have no success.
Can anybody help me with a codesnippet how I can access the element via the propertie "wicketpath"?
Thanks in advance.
R.
You should be able to use xpath.
For example, consider the following HTML
<ul class="ui-autocomplete" role="listbox">
<li class="ui-menu-item" role="menuitem" wicketpath="false">Value 1</li>
<li class="ui-menu-item" role="menuitem" wicketpath="false">Value 2</li>
<li class="ui-menu-item" role="menuitem" wicketpath="true">Value 3</li>
</ul>
The following xpath will give the text of the li that has wicketpath = true:
puts browser.li(:xpath, "//li[#wicketpath='true']").text
#=>Value 3
Update - Alternative solution - Adding To Locators:
If you use a lot of wicketpath, you could add it to the locators.
After you require watir-webdriver, add this:
# This allows using :wicketpath in locators
Watir::HTMLElement.attributes << :wicketpath
# This allows accessing the wicketpath attribute
class Watir::Element
attribute(String, :wicketpath, 'wicketpath')
end
This will let you use 'wicketpath' as a locator:
p browser.li(:wicketpath, 'true').text
#=> "Value 3"
p browser.li(:text, 'Value 3').wicketpath
#=> true
Try this
puts browser.li(:css, ".ui-autocomplete > .ui-menu-item[wicketpath='true']").text
Please Let me know is the above scripting is working or not.

Resources