Given the star ratings under the "Recent Comments" section here,
I am trying to build a list of the star rating per comment shown on the page.
The trouble is that each star rating objects does not have a value.
For example, I can get an individual star object via xpath like this:
from splinter import Browser
url = 'https://www.greatschools.org/texas/harker-heights/3978-Harker-Heights-Elementary-School/'
browser.visit(url)
astar=browser.find_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[3]/div/div[2]/div[1]/div[2]/span/span[1]')
The rub is that I cannot seem to access the value (filled in or not) for the object astar.
Here's the HTML:
<div class="answer">
<span class="five-stars">
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
</span>
</div>
UPDATE:
Some comments do not have star ratings at all, so I need to be able to determine if a particular comment has a star rating and, if so, what the rating is.
This seems helpful for at least getting a list of all stars. I used it to do this:
stars = browser.find_by_css('span[class="icon-star filled-star"]')
So if I can get a list showing the sequence of if a comment has a star rating (something like ratings = [1,0,1,1...]) and the sequence of all stars (i.e. ['Filled', 'Filled', 'Empty'...]), I think I can piece together the sequence.
One solution:
access the html attribute of each object like this:
#Get total number of comments
allcoms = len(browser.find_by_text('Overall experience'))
#Loop through all comments and gather into list
comments = []
#If pop-up box occurs, use div[4] instead of second div[5]
if browser.is_element_present_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[1]/div/div[2]'):
use='4'
else:
use='5'
for n in range(allcoms): #sometimes the second div[5] was div[4]
comments.append(browser.find_by_xpath('/html/body/div[5]/div['+use+']/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div['+str(n+1)+']/div/div[2]').value)
#Get all corresponding star ratings
#https://stackoverflow.com/questions/46468030/how-select-class-div-tag-in-splinter
ratingcode = []
ratings = browser.find_by_css('span[class="five-stars"]')
for a in range(len(comments)+2): #Add 2 to skip over first 2 ratings
if a<2: #skip first 2 and last 3 because these are other ratings - by just using range(len(comments)) above to get correct # before stopping
pass
else:
ratingcode.append(ratings[a].html)
Related
I am trying to scrape data from an e-commerce site for a certain product. On the result page, there are 50 products listed. Some products have original prices under them while some have discounted prices with original prices striked-out. The HTML code for that is
for non-discounted products
<div class="class-1">
<span>
Rs. 7999
</span>
</div>
For discounted product
<div class="class-1">
<span>
<span class="class-2">
Rs. 11621
</span>
<span class="class-3">
Rs. 15495
</span>
</span>
<span class="class-4">
(25% OFF)
</span>
</div>
What the result should be?
I want a code that could scroll through the list of products and extract data from Div[class='class-1]/span tag for the non-discounted product and where there is a child span[class='class-2'] present, it should extract data from only that tag and not from the Span[Class-3] tag.
Please help!!
If I understand you clearly, first you need to get a list of products with:
products = driver.find_element_by_xpath('//div[#class="class-1"]')
Now, you can iterate thru the list of products and grab the prices as following
prices = []
for product in products:
discount_price = product.find_elements_by_xpath('.//span[#class="class-2"]')
if(discount_price):
prices.append(discount_price[0].text)
else:
prices.append(product.find_element_by_xpath('./span').text)
Explanation:
Per each product I'm checking existence of //span[#class="class-2"] child element as you defined.
In case there is such an element, product.find_elements_by_xpath('.//span[#class="class-2"]') will return non-empty list of web elements. Not empty list is Boolean True in Python so if will go.
Otherwise the list is empty and else will go.
I am trying to pull the span (lets call it AAA before a specific span - BBB. This BBB span only shows up certain times on the page and I only want the AAA's which directly precede the BBBs.
Is there a way to select AAA's that are only proceeded by BBB? Or, to get to my proposed question, how can you use find_previous when you're running a select query? I am successful if I just use select_one -
AAA= selsoup.select_one('span.BBB').find_previous().text
but when I try to use select to pull all entries I get an error message (You're probably treating a list of elements like a single element.)
I've tried applying .find_previous in a for loop but that doesnt work either. Any suggestions?
Sorry, I probably should have added this before:
Adding code from the page -
<tr class="tree">
<th class="AAA">What I want right here<span class="BBB">(Aba: The New Look)</span></th>
Instead of .find_previous() you can use + in your CSS selector:
from bs4 import BeautifulSoup
html_doc = """
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
<span class="bbb"">txt</span>
<span class="aaa"">* This I Want *</span>
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for aaa in soup.select(".bbb + .aaa"):
print(aaa.text)
Prints:
* This I Want *
EDIT: Based on your edit:
bbb = soup.select_one(".AAA .BBB")
print(bbb.text)
Prints:
(Aba: The New Look)
One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)
I'm trying to use Scrapy with the CSS path to get the text in the fields of a number of span items. The CSS look like this:
<div class="announcement">
<span title="Name">Homer Simpson</span>
<span title="Date">2018-09-19</span>
<span title="Type">House</span>
</div>
I have tried with this:
response.css("div.announcement span::attr(title)").extract()
# ['Name', 'Date', 'Type']
response.css("div.announcement span::text").extract()
# ['Homer Simpson', '2018-09-19', 'House']
But that only results in a repeated list of the span titles, or I get all of them, but I just want one at the time. What I would like to have is something like:
response.css("div.announcement <SomeMagicHere>('Name')").extract()
# ['Homer Simpson']
How can I get a list of only the content of each of the title items, separately?
You can use "contains" attribute:
response.css("div.announcement span[title*='Name']::text").extract()
Actually, the situation is a little more complex.
I'm trying to get data from this example html:
<li itemprop="itemListElement">
<h4>
one
</h4>
</li>
<li itemprop="itemListElement">
<h4>
two
</h4>
</li>
<li itemprop="itemListElement">
<h4>
three
</h4>
</li>
<li itemprop="itemListElement">
<h4>
four
</h4>
</li>
For now, I'm using Python 3 with urllib and lxml.
For some reason, the following code doesn't work as expected (Please read the comments)
scan = []
example_url = "path/to/html"
page = html.fromstring(urllib.request.urlopen(example_url).read())
# Extracting the li elements from the html
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
# At this point, the list 'scan' length is 4 (Nothing wrong)
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
So as you can see, the first move is to extract the 4 li elements and append them to a list, then scan each li element for a element, but the problem is that each li element in scan is actually all the four elements.
...Or so I thought.
Doing a quick debugging, I found that the scan list contains the four li elements correctly, so I came to one possible conclusion: There's something wrong with the for loop aforementioned above.
for list_item in scan:
# This is supposed to print '1' since there's only one match
# Yet, this actually prints '4' (This is wrong)
print(len(list_item.xpath("//h4/a")))
# Something is wrong here...
The only real problem is that I can't pinpoint the bug. What causes that?
PS: I know, there's an easier way to get the a elements from the list, but this is just an example html, the real one contains many more... things.
In your example, when the XPath starts with //, it will start searching from the root of the document (which is why it was matching all four of the anchor elements). If you want to search relative to the li element, then you would omit the leading slashes:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath("h4/a")))
Of course you can also replace // with .// so that the search is relative as well:
for item in page.xpath("//li[#itemprop='itemListElement']"):
scan.append(item)
for list_item in scan:
print(len(list_item.xpath(".//h4/a")))
Here is a relevant quote taken from the specification:
2.5 Abbreviated Syntax
// is short for /descendant-or-self::node()/. For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document (even a para element that is a document element will be selected by //para since the document element node is a child of the root node); div//para is short for div/descendant-or-self::node()/child::para and so will select all para descendants of div children.
print(len(list_item.xpath(".//h4/a")))
// means /descendant-or-self::node()
it starts with /, so it will search from root node of the document.
use . to point the current context node is list_item, not the whole document