Incorrect number of results found by XPath - python-3.x

Actually, the situation is a little more complex.
I'm trying to get data from this example html:
<li itemprop="itemListElement">
<h4>
one
</h4>
</li>
<li itemprop="itemListElement">
<h4>
two
</h4>
</li>
<li itemprop="itemListElement">
<h4>
three
</h4>
</li>
<li itemprop="itemListElement">
<h4>
four
</h4>
</li>
For now, I'm using Python 3 with urllib and lxml.
For some reason, the following code doesn't work as expected (Please read the comments)
scan = []
example_url = "path/to/html"
page = html.fromstring(urllib.request.urlopen(example_url).read())
# Extracting the li elements from the html
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
# At this point, the list 'scan' length is 4 (Nothing wrong)
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
So as you can see, the first move is to extract the 4 li elements and append them to a list, then scan each li element for a element, but the problem is that each li element in scan is actually all the four elements.
...Or so I thought.
Doing a quick debugging, I found that the scan list contains the four li elements correctly, so I came to one possible conclusion: There's something wrong with the for loop aforementioned above.
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
# Something is wrong here...
The only real problem is that I can't pinpoint the bug. What causes that?
PS: I know, there's an easier way to get the a elements from the list, but this is just an example html, the real one contains many more... things.

In your example, when the XPath starts with //, it will start searching from the root of the document (which is why it was matching all four of the anchor elements). If you want to search relative to the li element, then you would omit the leading slashes:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath("h4/a")))
Of course you can also replace // with .// so that the search is relative as well:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath(".//h4/a")))
Here is a relevant quote taken from the specification:
2.5 Abbreviated Syntax
// is short for /descendant-or-self::node()/. For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document (even a para element that is a document element will be selected by //para since the document element node is a child of the root node); div//para is short for div/descendant-or-self::node()/child::para and so will select all para descendants of div children.

print(len(list_item.xpath(".//h4/a")))
// means /descendant-or-self::node()
it starts with /, so it will search from root node of the document.
use . to point the current context node is list_item, not the whole document

Related

lxml.html XPATH expression for element when the test has to be applied to the text_content not the text

I have the following html
<html>
<body>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_1"></a>
<a name="bananabread"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananabread"></a>Ban</font> <font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">ana Bread</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">The Best You Ever Tasted</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">If you don't agree that this is the best banana bread you have ever eaten well I would suggest you see your doctor</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_2"></a>
<a name="bananapudding"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananapudding"></a>Banana</font>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Pudding</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Creamy and Satisfying</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">This is the same recipe your mother used when you were ten!</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
</body>
</html>
I am trying to write an xpath expression to identify Banana Bread - my initial efforts were successful -
b_tree.xpath('.//*[starts-with(text(),"Banana Bread")]')
but I notice the error cases and upon investigation they are like the html above - another element is added inside the content I am searching for. Sometimes it is like above, a possibly unneeded font element, sometimes it is an anchor.
I worked with this answer (Related) but have not been successful
I can check for elements that have text_content() - clean up the text_content and then string match to my ultimate goal but I am hoping to learn to better apply xpath to these types of problems.
To be absolutely clear I need the text_content of the p element. But sometimes I just need the text of a font element. My existing XPATH expression works fine on the cases where there is not an intervening element. I do not know when I open the page the structure that was imposed on the document.
When the text() expression is applied to an element whose text content is interrupted by other elements, it returns a nodeset consisting of multiple text nodes, of which starts-with considers only the first. If you replace text() by ., you get the text value of the element, which is the concatenation of all text nodes, and that's what you want.
But there is still a problem with the spaces in an element like (attributes omitted, spaces are dots):
<p>
..<a></a>
..<a></a>
..<font>
....<a></a>Banana</font>
..<font>Pudding</font>
</p>
The text value of this element is _.._.._.._....Banana_..Pudding_ (underscores represent line feeds), therefore you must apply normalize-space, which normalizes this to Banana.Pudding, so that
.//*[starts-with(normalize-space(.),"Banana Pudding")]
finds this occurrence.
However, Banana Bread cannot be found, because it does not exist on the page. The element
<font>
..<a></a>Ban</font>.....<font>ana.Bread</font>
has a normalized text value of Ban.ana.Bread and you don't expect the space inside the word Banana. normalize-space removes spaces and line feeds that are invisible on the rendered page, but the two spaces in Ban.ana.Bread are both visible.
If there was no space between the two <font> elements,
.//*[starts-with(normalize-space(.),"Banana Bread")]
would detect 3 elements: the <html>, the <body> and the <p>, because "Banana Bread" are the first words in each of them. So you might better use
.//p[starts-with(normalize-space(.),"Banana Bread")]
instead.

How can i click the third href link?

<ul id='pairSublinksLevel1' class='arial_14 bold newBigTabs'>...<ul>
<ul id='pairSublinksLevel2' class='arial_12 newBigTabs'>
<li>...</li>
<li>...</li>
<li>
<a href='/equities/...'> last data </a> #<-- HERE
</li>
<li>...</li>
Question is how can i get click third li tag ??
In my code
xpath = "//ul[#id='pairSublinksLevel2']"
element = driver.find_element_by_xpath(xpath)
actions = element.find_element_by_css_selector('a').click()
code works partially. but i want to click third li tag.
The code keeps clicking on the second tag.
Try
driver.find_element_by_xpath("//ul[#id='pairSublinksLevel2']/li[3]/a").click()
EDIT:
Thanks #DebanjanB for suggestion:
When you get the element with xpath //ul[#id='pairSublinksLevel2'] and search for a tag in its child elements, then it will return the first match(In your case, it could be inside second li tag). So you can use indexing as given above to get the specific numbered match. Please note that such indexing starts from 1 not 0.
As per the HTML you have shared you can use either of the following solutions:
Using link_text:
driver.find_element_by_link_text("last data").click()
Using partial_link_text:
driver.find_element_by_partial_link_text("last data").click()
Using css_selector:
driver.find_element_by_css_selector("ul.newBigTabs#pairSublinksLevel2 a[href*='equities']").click()
Using xpath:
driver.find_element_by_xpath("//ul[#class='arial_12 newBigTabs' and #id='pairSublinksLevel2']//a[contains(#href,'equities') and contains(.,'last data')]").click()
Reference: Official locator strategies for the webdriver

Python Splinter Star Ratings

Given the star ratings under the "Recent Comments" section here,
I am trying to build a list of the star rating per comment shown on the page.
The trouble is that each star rating objects does not have a value.
For example, I can get an individual star object via xpath like this:
from splinter import Browser
url = 'https://www.greatschools.org/texas/harker-heights/3978-Harker-Heights-Elementary-School/'
browser.visit(url)
astar=browser.find_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[3]/div/div[2]/div[1]/div[2]/span/span[1]')
The rub is that I cannot seem to access the value (filled in or not) for the object astar.
Here's the HTML:
<div class="answer">
<span class="five-stars">
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
</span>
</div>
UPDATE:
Some comments do not have star ratings at all, so I need to be able to determine if a particular comment has a star rating and, if so, what the rating is.
This seems helpful for at least getting a list of all stars. I used it to do this:
stars = browser.find_by_css('span[class="icon-star filled-star"]')
So if I can get a list showing the sequence of if a comment has a star rating (something like ratings = [1,0,1,1...]) and the sequence of all stars (i.e. ['Filled', 'Filled', 'Empty'...]), I think I can piece together the sequence.
One solution:
access the html attribute of each object like this:
#Get total number of comments
allcoms = len(browser.find_by_text('Overall experience'))
#Loop through all comments and gather into list
comments = []
#If pop-up box occurs, use div[4] instead of second div[5]
if browser.is_element_present_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[1]/div/div[2]'):
use='4'
else:
use='5'
for n in range(allcoms): #sometimes the second div[5] was div[4]
comments.append(browser.find_by_xpath('/html/body/div[5]/div['+use+']/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div['+str(n+1)+']/div/div[2]').value)
#Get all corresponding star ratings
#https://stackoverflow.com/questions/46468030/how-select-class-div-tag-in-splinter
ratingcode = []
ratings = browser.find_by_css('span[class="five-stars"]')
for a in range(len(comments)+2): #Add 2 to skip over first 2 ratings
if a<2: #skip first 2 and last 3 because these are other ratings - by just using range(len(comments)) above to get correct # before stopping
pass
else:
ratingcode.append(ratings[a].html)

jquery / cheerio: how to select multiple elements?

I need to parse some markup similar to this one, from an html page:
<a href="#">
<i class="icon-location"></i>London
</a>
I need to get London.
I did try something like (using cheerio):
$('a', 'i[class="icon-location"]').text();
or
$('a > i[class="icon-location"]').text();
without success...
I'd like to avoid methods like next(), since the expression should be passed to a method which just extracts the text from the selector.
What expression should I use (if it's feasible) ?
There's a solution, which is pretty unusual, but it works :
$("#foo")
.clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
Demo : https://jsfiddle.net/2r19xvep/
Or, you could surround your value by a new tag so you just select it:
<i class="icon-location"></i><span class="whatever">London</span>
Then
$('.whatever').text();
$('a').text();
will get text as 'London'.
$("a .icon-location").map(function(){
return $(this).text()
}).get();

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Resources