lxml.html XPATH expression for element when the test has to be applied to the text_content not the text - python-3.x

I have the following html
<html>
<body>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_1"></a>
<a name="bananabread"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananabread"></a>Ban</font> <font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">ana Bread</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">The Best You Ever Tasted</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">If you don't agree that this is the best banana bread you have ever eaten well I would suggest you see your doctor</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_2"></a>
<a name="bananapudding"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananapudding"></a>Banana</font>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Pudding</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Creamy and Satisfying</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">This is the same recipe your mother used when you were ten!</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
</body>
</html>
I am trying to write an xpath expression to identify Banana Bread - my initial efforts were successful -
b_tree.xpath('.//*[starts-with(text(),"Banana Bread")]')
but I notice the error cases and upon investigation they are like the html above - another element is added inside the content I am searching for. Sometimes it is like above, a possibly unneeded font element, sometimes it is an anchor.
I worked with this answer (Related) but have not been successful
I can check for elements that have text_content() - clean up the text_content and then string match to my ultimate goal but I am hoping to learn to better apply xpath to these types of problems.
To be absolutely clear I need the text_content of the p element. But sometimes I just need the text of a font element. My existing XPATH expression works fine on the cases where there is not an intervening element. I do not know when I open the page the structure that was imposed on the document.

When the text() expression is applied to an element whose text content is interrupted by other elements, it returns a nodeset consisting of multiple text nodes, of which starts-with considers only the first. If you replace text() by ., you get the text value of the element, which is the concatenation of all text nodes, and that's what you want.
But there is still a problem with the spaces in an element like (attributes omitted, spaces are dots):
<p>
..<a></a>
..<a></a>
..<font>
....<a></a>Banana</font>
..<font>Pudding</font>
</p>
The text value of this element is _.._.._.._....Banana_..Pudding_ (underscores represent line feeds), therefore you must apply normalize-space, which normalizes this to Banana.Pudding, so that
.//*[starts-with(normalize-space(.),"Banana Pudding")]
finds this occurrence.
However, Banana Bread cannot be found, because it does not exist on the page. The element
<font>
..<a></a>Ban</font>.....<font>ana.Bread</font>
has a normalized text value of Ban.ana.Bread and you don't expect the space inside the word Banana. normalize-space removes spaces and line feeds that are invisible on the rendered page, but the two spaces in Ban.ana.Bread are both visible.
If there was no space between the two <font> elements,
.//*[starts-with(normalize-space(.),"Banana Bread")]
would detect 3 elements: the <html>, the <body> and the <p>, because "Banana Bread" are the first words in each of them. So you might better use
.//p[starts-with(normalize-space(.),"Banana Bread")]
instead.

Related

Getting text() element in <p> with VBA/Selenium

Using Excel 2019 VBA, I am trying to get data from a paragraph on a web page with this structure.
<p>
<strong>Release Date:</strong>
" May 30th 2022"
<br>
<strong>From:</strong>
<a href=URL>Title</a>
<br>
<strong>Performers:</strong>
<a href=URL1>Name1</a>,
<a href=URL2>Name2</a>,
<a href=URL3>Name3</a>
</p>
This is the xpath for the paragraph.
/html/body/div[11]/div/div/div[1]/div[1]/div/div/p[1]
To get the individual elements ("Release Date", "From" and "Performers"), I am having to parse the entire paragraph with "Instr"s or regular expressions.
Is there a way to directly reference these elements with XPath?
For example, the "Release Date" Xpath is:
/html/body/div[11]/div/div/div[1]/div[1]/div/div/p[1]/text()[1]
I have tried to get this directly with the following but none of them work.
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]/text()")(1) - Invalid Selector
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]").Attribute("text")(1) - returns nothing
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]")(1).Attribute("text") - returns nothing
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]").text(1) - invalid procedure call
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]")(1).text - returns entire paragraph
Any advice would be greatly appreciated.

Node - Cheerio - Find element that contains specific text

I am trying to get "text that I want" from the site with this structure of code:
<td class="x">
<h3 class="x"> number </h3>
<p>
text that I want;
</p>
</td>
If there will be one td with class "x" then I will do this:
$('td.x > p > a').text()
and get text that I want, but the problem is that on this site there are a lot of "td" and "h3" elements with the same class "x". The only difference is that each time the text that is in "h3" element is a different number and I know what number is in "h3" element on the place where is my link. For example:
<td class="x">
<h3 class="x"> **125** </h3>
<p>
text that I want;
</p>
</td>
The question is - is it possible to choose selector based on the text that is inside - in my example I know that in code there is h3 element with text "125" or maybe is better way to get text from "a" element in my case.
Contains is the selector you're looking for
$('h3:contains("**125**")')
This will select h3 that has the text you wanted

How to click on Web check box using Excel VBA?

How do I check the table checkbox?
I tried clicking.
ie.Document.getElementsByClassName("x-grid3-hd-checker").Checked = True
<div class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on" unselectable="on" style="">
<a class="x-grid3-hd-btn" href="#"></a>
<div class="x-grid3-hd-checker"> </div>
<img class="x-grid3-sort-icon" src="/javascript/extjs/resources/images/default/s.gif">
</div>
I can't see a checkbox in the HTML code. But you use getElementsByClassName() in a wrong way for your case. getElementsByClassName() generates a node collection. If you need a specific node, you must get it by it's index in the node collection. First element has index 0.
Please note that the div tag with the CSS class class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on " is also included in the Node Collection, because a part of the class identifier is identical to "x-grid3-hd-checker ". [Edit: I'm not realy sure if the part must maybe stand at the begin of the identifier]
If you want to check this:
<div class="x-grid3-hd-checker"> </div>
Your code needs the second index of the node collection:
ie.Document.getElementsByClassName("x-grid3-hd-checker")(1).Checked = True
But if there are more tags with the class name "x-grid3-hd-checker" the above line don't work. I can't say anymore until you don't post more HTML and VBA code. The best would be a link to the site.

Incorrect number of results found by XPath

Actually, the situation is a little more complex.
I'm trying to get data from this example html:
<li itemprop="itemListElement">
<h4>
one
</h4>
</li>
<li itemprop="itemListElement">
<h4>
two
</h4>
</li>
<li itemprop="itemListElement">
<h4>
three
</h4>
</li>
<li itemprop="itemListElement">
<h4>
four
</h4>
</li>
For now, I'm using Python 3 with urllib and lxml.
For some reason, the following code doesn't work as expected (Please read the comments)
scan = []
example_url = "path/to/html"
page = html.fromstring(urllib.request.urlopen(example_url).read())
# Extracting the li elements from the html
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
# At this point, the list 'scan' length is 4 (Nothing wrong)
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
So as you can see, the first move is to extract the 4 li elements and append them to a list, then scan each li element for a element, but the problem is that each li element in scan is actually all the four elements.
...Or so I thought.
Doing a quick debugging, I found that the scan list contains the four li elements correctly, so I came to one possible conclusion: There's something wrong with the for loop aforementioned above.
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
# Something is wrong here...
The only real problem is that I can't pinpoint the bug. What causes that?
PS: I know, there's an easier way to get the a elements from the list, but this is just an example html, the real one contains many more... things.
In your example, when the XPath starts with //, it will start searching from the root of the document (which is why it was matching all four of the anchor elements). If you want to search relative to the li element, then you would omit the leading slashes:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath("h4/a")))
Of course you can also replace // with .// so that the search is relative as well:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath(".//h4/a")))
Here is a relevant quote taken from the specification:
2.5 Abbreviated Syntax
// is short for /descendant-or-self::node()/. For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document (even a para element that is a document element will be selected by //para since the document element node is a child of the root node); div//para is short for div/descendant-or-self::node()/child::para and so will select all para descendants of div children.
print(len(list_item.xpath(".//h4/a")))
// means /descendant-or-self::node()
it starts with /, so it will search from root node of the document.
use . to point the current context node is list_item, not the whole document

Heading and paragraph text together without space

I have -
<p>some text as 'intro'</p>
<h1>Big Text</h1>
<p>some text as 'outro'</p>
I have this set out on a background image, I have styled margins and fitted the text inside properly, but I want to bunch up ALL text so there is little gap - line-height would ruin it and I have tried seperate div tags but no luck - what is the best chosen css method for this?
Thanks!
If you require the use of those elements you could use negative margins:
<p>some text as 'intro'</p>
<h1 style="margin: -15px 0 -15px">Big Text</h1>
<p>some text as 'outro'</p>
A better way is probably to separate the different lines by line breaks and to style the 'header' line, like so:
<p>some text as 'intro'<br />
<span style="font-size: 200%; font-weight: bold;">Big Text</span><br />
some text as 'outro'</p>

Resources