How to get substring from string using xpath 1.0 in lxml - python-3.x

This is the example HTML.
<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>
I am in a situation where I need to extract
Chamber of Secrets
Prisoners in Azkabahn
I am using lxml 4.2.1 in python which uses xpathb1.0.
I have tried to extract using XPath
'substring-after(//a/#href,"HarryPotter:")'
which returns only "Chamber of Secrets".
and with XPath
'//a/#href[substring-after(.,"HarryPotter:")]'
which returns
'HarryPotter:Chamber of Secrets'
'HarryPotter:Prisoners in Azkabahn'
I have researched for it and got new learning but didn't find the fix of my problem.
I have hit and tried different XPath using substring-after.
In my research, I got to know that it could also be accomplished by regex too, then I tried and failed.
I found that it is easy to manipulate a string in XPath 2.0 and above using regex but we can also use regex in XPath 1.0 using XSLT extensions.
Could we do it with substring-after function, if yes then what is the XPath and if No then what is the best approach to get the desired output?
And how we can get the desired output using regex in XPath by sticking to lxml.

Try this approach to get both text values:
from lxml import html
raw_source = """<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>"""
source = html.fromstring(raw_source)
for link in source.xpath('//a'):
print(link.xpath('substring-after(#href, "HarryPotter:")'))

If you want to use substring-after() and substring-before() and together
Here is example:
from lxml import html
f_html = """<html><body><table><tbody><tr><td class="df9" width="20%">
<a class="nodec1" href="javascript:reqDl(1254);" onmouseout="status='';" onmouseover="return dspSt();">
<u>
2014-2
</u>
</a>
</td></tr></tbody></table></body></html>"""
tree_html = html.fromstring(f_html)
deal_id = tree_html.xpath("//td/a/#href")
print(tree_html.xpath('substring-after(//td/a/#href, "javascript:reqDl(")'))
print(tree_html.xpath('substring-before(//td/a/#href, ")")'))
print(tree_html.xpath('substring-after(substring-before(//td/a/#href, ")"), "javascript:reqDl(")'))
Result:
1254);
javascript:reqDl(1254
1254

Related

How to extract a field from xpath which is not present in an element tag?

<div class="info">
<span class="label">Establishment year</span>
"2008"
</div>
I want to extract 2008 by using xpath but the expression just selects the establishment text.
driver.find_element_by_xpath("//*[text()='Establishment year']")
As the text 2008 is within a text node to extract the text 2008 you can use the following solution:
print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='info']/span[#class='label' and text()='Establishment year']/..")))).strip())
Unfortunately WebDriver does not allow find_element function result to be a Text Node so you will have to go for execute_script function like:
driver.execute_script(
"return document.evaluate(\"//div[#class='info']/node()[3]\", document, null, XPathResult.STRING_TYPE, null).stringValue;")
Demo:
More information:
XPath Tutorial
XPath Axes
XPath Operators & Functions

How can i click the third href link?

<ul id='pairSublinksLevel1' class='arial_14 bold newBigTabs'>...<ul>
<ul id='pairSublinksLevel2' class='arial_12 newBigTabs'>
<li>...</li>
<li>...</li>
<li>
<a href='/equities/...'> last data </a> #<-- HERE
</li>
<li>...</li>
Question is how can i get click third li tag ??
In my code
xpath = "//ul[#id='pairSublinksLevel2']"
element = driver.find_element_by_xpath(xpath)
actions = element.find_element_by_css_selector('a').click()
code works partially. but i want to click third li tag.
The code keeps clicking on the second tag.
Try
driver.find_element_by_xpath("//ul[#id='pairSublinksLevel2']/li[3]/a").click()
EDIT:
Thanks #DebanjanB for suggestion:
When you get the element with xpath //ul[#id='pairSublinksLevel2'] and search for a tag in its child elements, then it will return the first match(In your case, it could be inside second li tag). So you can use indexing as given above to get the specific numbered match. Please note that such indexing starts from 1 not 0.
As per the HTML you have shared you can use either of the following solutions:
Using link_text:
driver.find_element_by_link_text("last data").click()
Using partial_link_text:
driver.find_element_by_partial_link_text("last data").click()
Using css_selector:
driver.find_element_by_css_selector("ul.newBigTabs#pairSublinksLevel2 a[href*='equities']").click()
Using xpath:
driver.find_element_by_xpath("//ul[#class='arial_12 newBigTabs' and #id='pairSublinksLevel2']//a[contains(#href,'equities') and contains(.,'last data')]").click()
Reference: Official locator strategies for the webdriver

Finding out if data-sold-out="false" in html using beautifulsoup

data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="
this is the html code and im trying to get find if data-sold-out="false" or true so i can than do something with it. I am wondering how can i find out what data-sold-out id equal to and return it. I am using python and beautiful soup.
any help appreciated
Are you trying to find any tags with data-sold-out="false" or data-sold-out="true" right?
I think you can do this
all_html = bs('<a data-style-name="Gold" data-style-id="20316" data-sold-out="false" data-description="null" alt="Tvywspp25q0" /></a>
<a class="" data-images="">')
a_tag = all_html.findAll(attrs={"data-sold-out": "false"})
then you can extract any attribute inside them like this
for item in a_tag:
print(item['data-style-name'])

Using XPath, select node without text sibling

I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[#class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?
Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.
As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:
>>> bars[0].tail = ''
>>> print(html.tostring(bars[0], encoding="unicode"))
<span class="bar">bar</span>

I want to get text from anchor tag using selenium python I want print text helloworld

<div class="someclass">
<p class="name">helloworld</p>
</div>
//I want to print helloworld text from anchor tag, using python selenium code.
You can do it using CSS:
.find_element_by_css_selector("p.name a")`,
or you can do it using xpath:
.find_element_by_xpath("//p[#class='name']/a")
Example:
element = self.browser.find_element_by_css_selector("p.name a")
print element.get_attribute("text")
I hope this helped, if not tell me :)
One step solution:
browser.find_element_by_xpath('//p[#class="name"]/a').get_attribute('text')
The gives you the text of anchor tag.
To get the text from any html tag using Selenium in python,
You can simply use ".get_attribute('text')".
In this case:
a_tag = self.driver.find_element_by_css_selector("p.name a")
a_tag.get_attribute('text')

Resources