Select specific text from specific span - python-3.x

I am trying to use python to scrape TripAdvisor and pull text from a specific span ---> <span>138<span> (without the Excellent)
<label for="taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_5">
<div class="row_label">Excellent</div>
<span class="row_bar">
<span class="row_fill" style="width:65%;"></span>
</span>
<span>138<span>
</span></span></label>
This is my code thus far:
for rating_all in moresoup.findAll('div',{'class':'col rating '}):
for record in rating_all.findAll('li'):
for rate1 in record.findAll('label',{'for':"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_1"}):
print(rate1.find('div',{'class':"row_label"}).text + ",\t")
print(rate1.findAll('span'))
I tried using a subscript but it wouldn't let me. When I use the .text after the span it says there is no text, when I change it to find instead of find all it only finds the first span.

findAll (or more commonly find_all - which does the same thing) returns a list of all Tag objects matching your filters. Even if there is only one matching Tag you will still get a one-item list: [Tag].
Once you have the list of tags, you can get a single tag by indexing, e.g.:
soup.find_all('span')[0]
and you can get the text of one your tags with the .text attribute:
soup.find_all('span')[0].text
In your particular case, I was able to get the text '138\n' with:
rate1.findAll('span')[2].text

Related

Xpath or CSS Selector to select all child nodes that come before a specific child node

I have this following html data :
<span id="description">
<p>description</p>
<p>description</p>
<p></p>
<h3>title1</h3>
some text. <br>
<br>
some text.
<h3>title2</h3>
<p></p>
<p></p>
<div>data</div>
<h3>title3</h3>
<strong>data</strong>
<br>
some text.
<br>
<br>
some text.
<p></p>
</span>
I need to get all the p tags up to first h3 tag.
I tried Xpath //span[#id="description"] which will get all the children of the span tag, which I dont need.
I also tried //span[#id="description"]/h3[1]/preceding-sibling::p which only returned first preceding p tag. Also selecting individual p nodes and then combining them are not feasible since different pages will have different number of p nodes before the first h3.
Then I tried with CSS selectors and remove function, $('#description').clone().children('div,h3').remove().end().html().trim(). Which didnt work well, since I cant select text nodes with it.
Is there anyway I can split the data with these h3 tags?
Your expression
//span[#id="description"]/h3[1]/preceding-sibling::p
should work.
A similar expression
//span[#id="description"]/h3[1]//preceding-sibling::p
should also work.
Also try this one:
//span[#id="description"]/p[following-sibling::h3[contains(text(),"title1")]]

Validating text after <br> in Atata

I have this table cell on my page:
<td>
<strong>Some Text</strong>
<br>random description
</td>
I'd like to come up with a way to validate the text after the br and have it return that as Text<T> so I can use the atata asserts (like Should.Equal("random description")) - separately from the text inside the <strong> but so far have been unable to do anything more than get that text as a string by getting the td via xpath and .Split("\r\n") it's Value - is there a way to get just this text?
You can do that by adding [ContentSource(ContentSource.LastChildTextNode)] attribute to your text control.
Alternatively, you can invoke TextControl.GetContent(ContentSource.LastChildTextNode) method.

Extract multiple image links from a single div tag that has no class name

Hi I have this HTML code from a website: I want to be able to extract multiple images, I have cases where there is 3-4 images. How would I go about doing that?
<div style="float: right;"><u>Chakra required:</u>
<img src="https://naruto-arena.net/images/energy/energy_2.gif">
<img src="https://naruto-arena.net/images/energy/energy_4.gif">
</div>
My code:
chakras1 = soup.find_all("div")[42].img['src']
print(chakras1)
Result:
https://naruto-arena.net/images/energy/energy_2.gif
I only get the FIRST image but now the second one.
To extract multiple images from within a tag, I personally stick to using for loops. So what I mean is once you find the div tag that you want to look within and say you call that chakras1, I would write the following:
for img in chakras1.find_all("img"):
print(img)
For me the steps kind of go as follows:
find the specific tags you are looking for (img src)
see what tag those tags are within (
navigate through the HTML to that tag (using beautiful soup's .find or .find_all functions depending on what you want to use)
once you have navigated to the tag search within that tag for the tags you're really looking for.
One quick note, with this method, if you are looking through multiple div tags, you're also going to need to loop through those as well.
I hope this makes sense!

Selenium Webdriver: find a span that has an inner span with specific text

Using Selenimum Webdriver in Python, I want to be able to find My Text from the following html code:
<span class="first">My Text
<span class="second">(Total)</span>
</span>
The inner span has a unique word in it which is (Total).
The outer span has the text that I am looking for, which is My Text. How can I find this text?
I have tried to use something like:
driver.find_elements_by_xpath("//*[contains(text(), '(Total)')]/prededing-sibling::span")
But it was not successful.
I appreciate any help on this.
Try below solution to get required text:
span = driver.find_element_by_xpath('//span[span="(Total)"]')
required_text = driver.execute_script('return arguments[0].childNodes[0].textContent', span).strip()
Output of print(required_text)
'My Text'
Note that span with text "My Text" is not a sibling of span with text "(Total)", but its parent, so you can not fetch it with preceding-sibling::span
You are using preceding sibling, that's wrong, you are looking at what's there in the parent node
So use this
"//span[text()='(Total)']/.."
The above xpath helps you to locate the corresponding parent node, and then you use programming language to get the particular string
I don't know Python, I would do in Ruby in the following way
a='(Total)'
puts b.element(xpath: "//span[text()='#{a}']/..").text.chomp(a).strip
=> My Text

Excel getElementById extract the span class information

I need to extract certain information from HTML using VBA.
This is the HTML from which I am trying to extract the location information alone.
<dl id="headline" class="demographic-info adr">
<dt>Location</dt>
<dd>
<span class="locality">
Dallas/Fort Worth Area
</span>
</dd>
<dt>Industry</dt>
<dd class="industry">
Higher Education
</dd>
In my excel VBA, after opening the web page, I am using the following code to extract the information.
Dim openedpage as String
openedpage = iedoc1.getElementById("headline").innerText
However, I am getting the information as,
Location Dallas/Fort Worth Area Industry Higher Education
I just need to extract,
Dallas/Fort Worth Area as the output.
Try: iedoc1.getElementById("headline").getElementsByTagName("span")(0).innerText
Your getting all the extra text because that is kinda what you asked for, the innerText of the parent element, which is everything inside of it.
The above code gets the content of the "headline" element, then finds all "span" tags inside of it. Looking at the list returned, it chooses the first instance and returns the innerText.
Update
I always seem to get the index base wrong, the 1 in my example should have been a 0

Resources