Using Excel 2019 VBA, I am trying to get data from a paragraph on a web page with this structure.
<p>
<strong>Release Date:</strong>
" May 30th 2022"
<br>
<strong>From:</strong>
<a href=URL>Title</a>
<br>
<strong>Performers:</strong>
<a href=URL1>Name1</a>,
<a href=URL2>Name2</a>,
<a href=URL3>Name3</a>
</p>
This is the xpath for the paragraph.
/html/body/div[11]/div/div/div[1]/div[1]/div/div/p[1]
To get the individual elements ("Release Date", "From" and "Performers"), I am having to parse the entire paragraph with "Instr"s or regular expressions.
Is there a way to directly reference these elements with XPath?
For example, the "Release Date" Xpath is:
/html/body/div[11]/div/div/div[1]/div[1]/div/div/p[1]/text()[1]
I have tried to get this directly with the following but none of them work.
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]/text()")(1) - Invalid Selector
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]").Attribute("text")(1) - returns nothing
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]")(1).Attribute("text") - returns nothing
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]").text(1) - invalid procedure call
webdriver.FindElementsByXPath("//div[11]/div/div/div[1]/div[1]/div/div/p[1]")(1).text - returns entire paragraph
Any advice would be greatly appreciated.
Related
I have the following html
<html>
<body>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_1"></a>
<a name="bananabread"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananabread"></a>Ban</font> <font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">ana Bread</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">The Best You Ever Tasted</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">If you don't agree that this is the best banana bread you have ever eaten well I would suggest you see your doctor</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_2"></a>
<a name="bananapudding"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananapudding"></a>Banana</font>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Pudding</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Creamy and Satisfying</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">This is the same recipe your mother used when you were ten!</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
</body>
</html>
I am trying to write an xpath expression to identify Banana Bread - my initial efforts were successful -
b_tree.xpath('.//*[starts-with(text(),"Banana Bread")]')
but I notice the error cases and upon investigation they are like the html above - another element is added inside the content I am searching for. Sometimes it is like above, a possibly unneeded font element, sometimes it is an anchor.
I worked with this answer (Related) but have not been successful
I can check for elements that have text_content() - clean up the text_content and then string match to my ultimate goal but I am hoping to learn to better apply xpath to these types of problems.
To be absolutely clear I need the text_content of the p element. But sometimes I just need the text of a font element. My existing XPATH expression works fine on the cases where there is not an intervening element. I do not know when I open the page the structure that was imposed on the document.
When the text() expression is applied to an element whose text content is interrupted by other elements, it returns a nodeset consisting of multiple text nodes, of which starts-with considers only the first. If you replace text() by ., you get the text value of the element, which is the concatenation of all text nodes, and that's what you want.
But there is still a problem with the spaces in an element like (attributes omitted, spaces are dots):
<p>
..<a></a>
..<a></a>
..<font>
....<a></a>Banana</font>
..<font>Pudding</font>
</p>
The text value of this element is _.._.._.._....Banana_..Pudding_ (underscores represent line feeds), therefore you must apply normalize-space, which normalizes this to Banana.Pudding, so that
.//*[starts-with(normalize-space(.),"Banana Pudding")]
finds this occurrence.
However, Banana Bread cannot be found, because it does not exist on the page. The element
<font>
..<a></a>Ban</font>.....<font>ana.Bread</font>
has a normalized text value of Ban.ana.Bread and you don't expect the space inside the word Banana. normalize-space removes spaces and line feeds that are invisible on the rendered page, but the two spaces in Ban.ana.Bread are both visible.
If there was no space between the two <font> elements,
.//*[starts-with(normalize-space(.),"Banana Bread")]
would detect 3 elements: the <html>, the <body> and the <p>, because "Banana Bread" are the first words in each of them. So you might better use
.//p[starts-with(normalize-space(.),"Banana Bread")]
instead.
So, essentially I want to get the text from the site and print it onto console.
This is the HTML snippet:
<div class="inc-vat">
<p class="price">
<span class="smaller currency-symbol">£</span>
1,500.00
<span class="vat-text"> inc. vat</span>
</p>
</div>
Here is an image of the DOM properties:
How would I go abouts retrieving the '1,500.00'? I have tried to use self.browser.find_element_by_xpath('//*[#id="main-content"]/div/div[3]/div[1]/div[1]/text()') but that throws an error which says The result of the xpath expression is: [object Text]. It should be an element. I have also used other methods like .text but they either only print the '£' symbol, print a blank or throw the same error.
You can use below css :
p.price
sample code :-
elem = driver.find_element_by_css_selector("p.price").text.split(' ')[1]
print(elem)
How do I check the table checkbox?
I tried clicking.
ie.Document.getElementsByClassName("x-grid3-hd-checker").Checked = True
<div class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on" unselectable="on" style="">
<a class="x-grid3-hd-btn" href="#"></a>
<div class="x-grid3-hd-checker"> </div>
<img class="x-grid3-sort-icon" src="/javascript/extjs/resources/images/default/s.gif">
</div>
I can't see a checkbox in the HTML code. But you use getElementsByClassName() in a wrong way for your case. getElementsByClassName() generates a node collection. If you need a specific node, you must get it by it's index in the node collection. First element has index 0.
Please note that the div tag with the CSS class class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on " is also included in the Node Collection, because a part of the class identifier is identical to "x-grid3-hd-checker ". [Edit: I'm not realy sure if the part must maybe stand at the begin of the identifier]
If you want to check this:
<div class="x-grid3-hd-checker"> </div>
Your code needs the second index of the node collection:
ie.Document.getElementsByClassName("x-grid3-hd-checker")(1).Checked = True
But if there are more tags with the class name "x-grid3-hd-checker" the above line don't work. I can't say anymore until you don't post more HTML and VBA code. The best would be a link to the site.
This is a piece of HTML from which I'd like to extract information from:
<li>
<p><strong class="more-details-section-header">Provenance</strong></p>
<p>Galerie Max Hetzler, Berlin<br>Acquired from the above by the present owner</p>
</li>
I'd like to have an xpath expression which extracts the content of the 2nd <p> ... </p> depending if there's a sibling before with <p> ... Provenance ... </p>
This is to where I got so far:
if "Provenance" in response.xpath('//strong[#class="more-details-section-header"]/text()').extract():
print("provenance = yes")
But how do I get to Galerie Max Hetzler, Berlin<br>Acquired from the above by the present owner ?
I tried
if "Provenance" in response.xpath('//strong[#class="more-details-section-header"]/text()').extract():
print("provenance = yes ", response.xpath('//strong[#class="more-details-section-header"]/following-sibling::p').extract())
But am getting []
You should use
//p[preceding-sibling::p[1]/strong='Provenance']/text()
I am extracting data from HTML using Vb Script. This is the HTML code from which am trying to extract the data.
<dl id="overview">
<dt id="overview-summary-current-title" class="summary-current" style="display:block">
Current
</dt>
<dd class="summary-current" style="display:block">
<ul class="current">
<li>
Software Engineer
<span class="at">at </span>
<a class="company-profile-public" href="/company/ABC Systems?trk=ppro_cprof">
<span class="org summary">ABC Systems</span></a>
</li>
</ul>
</dd>
In my previous question, I had asked for a similar doubt. The link is Excel getElementById extract the span class information.
However, in that case, I wanted to extract the information corresponding to the dl id and it also had span id. In this case, I need to extract the information corresponding to the dt id.
In my VB Script, I tried something like this.
Dim openedpage as String
openedpage = iedoc1.getElementById("overview").getElementById("overview-summary-current-title").innerHTML
However, I am getting no output.
I want the output as Software Engineer at ABC systems.
Kindly help me out.
The object returned by getElementById() doesn't have a method .getElementById(), so the following line fails:
.getElementById("overview").getElementById("overview-summary-current-title")
If you don't get any output, not even an error message, you probably have On Error Resume Next somewhere in your script. Please don't use that unless you know exactly what you're doing and have sensible error handling code in place.
Also, the element with the ID "overview-summary-current-title" is this:
<dt id="overview-summary-current-title" class="summary-current" style="display:block">
Current
</dt>
So you couldn't possibly extract the text "Software Engineer at ABC systems" from that element.
Try selecting the first <ul> tag from the element with the ID "overview", and then use the innerText property instead of the innerHtml property:
Set ie = CreateObject("InternetExplorer Application")
ie.Navigate "..."
While ie.Busy : WScript.Sleep 100 : Wend
Set e1 = ie.document.getElementById("overview")
Set e2 = e1.getElementsByTagName("ul")(0)
WScript.Echo e2.innerText