Using XPath, select node without text sibling - python-3.x

I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[#class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?

Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.
As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:
>>> bars[0].tail = ''
>>> print(html.tostring(bars[0], encoding="unicode"))
<span class="bar">bar</span>

Related

Dom Traversal with lxml into a graph database and pass on ID to establish a complete tree

I'm inserting hierarchical data made of a DOM Tree into a graph database but, I'm not able to establish the complete relationship between the nodes. I while looping I ended up truncating the trees
Below is the code that illustrates a traversing of DOM nodes, inserting the tags and obtaining the last inserted id. The problem I'm having is how to properly connect the trees by passing the ID obtained from the previous iteration.
I had the same issue when I used a recursion.
How do I loop and pass the IDs so they can be evenly connected to all the node trees?
Considering the following HTML
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Document</title>
</head>
<body>
<ul class="menu">
<div class="itm">home</div>
<div class="itm">About us</div>
<div class="itm">Contact us</div>
</ul>
<div id="idone" class="classone">
<li class="item1">First</li>
<li class="item2">Second</li>
<li class="item3">Third</li>
<div id="innerone"><h1>This Title</h1></div>
<div id="innertwo"><h2>Subheads</h2></div>
</div>
<div id="second" class="below">
<div class="inner">
<h1>welcome</h1>
<h1>another</h1>
<h2>third</h2>
</div>
</div>
</body>
</html>
With the current python code, I ended up with the truncated tree as illustrated. I omitted the graph Database driver. in order to focus on the cypher since most graph database follows almost the same cypher query.
import json
from lxml import etree
from itertools import tee
from lxml import html
for n in dom_tree.iter():
cursor = Cypher("CREATE (t:node {tag: %s} ) RETURN id(t)", params=(n.tag,))
parent_id = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"Parent:{n.tag}")
for x in n.iterchildren():
cursor = Cypher("CREATE (x:node {tag: %s} ) RETURN id(x)", params=(x.tag,))
xid = cursor.fetchone()[0] # get last inserted ID
ag.commit()
print(f"--------{x.tag}")
cx = Cypher("MATCH (p:node),(k:node) WHERE id(p) = %s AND id(k) = %s CREATE (p)-[:connect {name: p.name+ '->'+k.name}]->(k)", params=(eid, xid,))
ElementTree provides a method, getiterator(), to iterate over every element in the tree.
As you visit each node, you can also get a list of its children with element.getchildren() or its parent with element.getparent().
from lxml import html
tree = html.parse("about.html")
for element in tree.getiterator():
if parent := element.getparent():
print(f"The element {element.tag} with text {element.text} and attributes {element.attrib} is the child of the element {parent.tag}")
In your case you crate tag nodes for the parent time and again, you need to pass on the identifier for the parent node for each level down.
Either you have to take a recursive approach where you pass in the id of the parent to the function you call recursively.
Or you need to assign the elements in your dom tree an id e.g. based on level and position (sibling) to identify them uniquely.
in neo4j you can also use
MATCH (n) WHERE id(n) = $id
MERGE (n)-[:CHILD]->(m:Node {tag:$tag)
to create the child within the context of the parent but that doesn't work with duplicate tags.
I'd go with the recursive approach.

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?
Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)
If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

How to get substring from string using xpath 1.0 in lxml

This is the example HTML.
<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>
I am in a situation where I need to extract
Chamber of Secrets
Prisoners in Azkabahn
I am using lxml 4.2.1 in python which uses xpathb1.0.
I have tried to extract using XPath
'substring-after(//a/#href,"HarryPotter:")'
which returns only "Chamber of Secrets".
and with XPath
'//a/#href[substring-after(.,"HarryPotter:")]'
which returns
'HarryPotter:Chamber of Secrets'
'HarryPotter:Prisoners in Azkabahn'
I have researched for it and got new learning but didn't find the fix of my problem.
I have hit and tried different XPath using substring-after.
In my research, I got to know that it could also be accomplished by regex too, then I tried and failed.
I found that it is easy to manipulate a string in XPath 2.0 and above using regex but we can also use regex in XPath 1.0 using XSLT extensions.
Could we do it with substring-after function, if yes then what is the XPath and if No then what is the best approach to get the desired output?
And how we can get the desired output using regex in XPath by sticking to lxml.
Try this approach to get both text values:
from lxml import html
raw_source = """<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>"""
source = html.fromstring(raw_source)
for link in source.xpath('//a'):
print(link.xpath('substring-after(#href, "HarryPotter:")'))
If you want to use substring-after() and substring-before() and together
Here is example:
from lxml import html
f_html = """<html><body><table><tbody><tr><td class="df9" width="20%">
<a class="nodec1" href="javascript:reqDl(1254);" onmouseout="status='';" onmouseover="return dspSt();">
<u>
2014-2
</u>
</a>
</td></tr></tbody></table></body></html>"""
tree_html = html.fromstring(f_html)
deal_id = tree_html.xpath("//td/a/#href")
print(tree_html.xpath('substring-after(//td/a/#href, "javascript:reqDl(")'))
print(tree_html.xpath('substring-before(//td/a/#href, ")")'))
print(tree_html.xpath('substring-after(substring-before(//td/a/#href, ")"), "javascript:reqDl(")'))
Result:
1254);
javascript:reqDl(1254
1254

I want to get text from anchor tag using selenium python I want print text helloworld

<div class="someclass">
<p class="name">helloworld</p>
</div>
//I want to print helloworld text from anchor tag, using python selenium code.
You can do it using CSS:
.find_element_by_css_selector("p.name a")`,
or you can do it using xpath:
.find_element_by_xpath("//p[#class='name']/a")
Example:
element = self.browser.find_element_by_css_selector("p.name a")
print element.get_attribute("text")
I hope this helped, if not tell me :)
One step solution:
browser.find_element_by_xpath('//p[#class="name"]/a').get_attribute('text')
The gives you the text of anchor tag.
To get the text from any html tag using Selenium in python,
You can simply use ".get_attribute('text')".
In this case:
a_tag = self.driver.find_element_by_css_selector("p.name a")
a_tag.get_attribute('text')

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Resources