Scraping data when a parent tag has a child for some element only - python-3.x

I am trying to scrape data from an e-commerce site for a certain product. On the result page, there are 50 products listed. Some products have original prices under them while some have discounted prices with original prices striked-out. The HTML code for that is
for non-discounted products
<div class="class-1">
<span>
Rs. 7999
</span>
</div>
For discounted product
<div class="class-1">
<span>
<span class="class-2">
Rs. 11621
</span>
<span class="class-3">
Rs. 15495
</span>
</span>
<span class="class-4">
(25% OFF)
</span>
</div>
What the result should be?
I want a code that could scroll through the list of products and extract data from Div[class='class-1]/span tag for the non-discounted product and where there is a child span[class='class-2'] present, it should extract data from only that tag and not from the Span[Class-3] tag.
Please help!!

If I understand you clearly, first you need to get a list of products with:
products = driver.find_element_by_xpath('//div[#class="class-1"]')
Now, you can iterate thru the list of products and grab the prices as following
prices = []
for product in products:
discount_price = product.find_elements_by_xpath('.//span[#class="class-2"]')
if(discount_price):
prices.append(discount_price[0].text)
else:
prices.append(product.find_element_by_xpath('./span').text)
Explanation:
Per each product I'm checking existence of //span[#class="class-2"] child element as you defined.
In case there is such an element, product.find_elements_by_xpath('.//span[#class="class-2"]') will return non-empty list of web elements. Not empty list is Boolean True in Python so if will go.
Otherwise the list is empty and else will go.

Related

Is it possible to make checkout page by yourself?

I tried to make that when click on buy it goes to stripe checkout page but,i need that if i change option to "exlusive" it change price only title and other stays same
enter image description here
enter image description here
enter image description here
<section id="prodetails" class="section-p1">
<div class="single-pro-image">
<img src="img/Cover Arts/Autumn.png" width="100%" id="MainImage" alt="">
</div>
<div class="single-pro-details"> <!--Fix in css-->
<h6>Home / Beats</h6>
<h4>Melodic Pop Beat = "Autumn"</h4>
<h2 id="price">$5</h2>
<select id="select">
<option>Select Licence</option>
<option>MP3</option>
<option>Tagged Wav</option>
<option>Un-Tagged Wav</option>
<option>Stems</option>
<option>Exlusive</option>
</select>
<script>
let select = document.getElementById('select');
let price = document.getElementById('price');
// Prices
let prices = {
"Select Licence": '$5',
"MP3": '$5',
"Tagged Wav": '$7',
"Un-Tagged Wav": '$10',
"Stems": '$15',
"Exlusive": '$50'
}
// When the value of select changes, this event listener fires and changes the text content of price to the coresponding value from the prices object
select.addEventListener('change', () => {
price.textContent = prices[select.value];
});
</script>
<!--<h2 id="sproduct-price">$25</h2>-->
<button class="normal">Add to Card</button>
<h4>Product Details</h4>
<span>Melodic Pop Beat - Autumn,Pop Beat in G Minor and BPM of 130,Beat is simple and melodic,It has the vibe of Dua Lipa and Weeknd Beat,The prices is great;just $5 for an tagged and mastered MP3,$7 tagged and unmastered Wav,$10 for un-tagged unmasterd and $15 for Stems,Exlusive are $50</span>
</div>
<div id="audio">
<audio controls style="width:100%;">
<source src="Audio/Dua lipa 130 x Gmin.mp3" type="audio/mpeg">
</audio>
</div>
</section>
Your button is set to go to a Stripe Payment Link, which is a reusable link that will always charge someone for the same thing.
If you want people to be able to purchase different things you need to create different Payment Links for the different options, then adjust your page so it sends them to the right Payment Link based on what they want to buy.

xpath how to get the last value of first level of children in the case of the number of children is not always the same

With the following code:
data = driver.find_elements(By.XPATH, '//div[#class="postInfo desktop"]/span[#class="nameBlock"]')
I got those html codes below:
<span class="nameBlock">
<span class="name">Anonymous</span>
<span class="posteruid id_RDS8pJvL">(ID:
<span class="hand" title="Highlight posts by this ID" style="background-color: rgb(228, 51,
138); color: white;">RDS8pJvL</span>)</span>
<span title="United States" class="flag flag-us"></span>
</span>
And
<span class="nameBlock">
<span class="name">Pierre</span>
<span class="postertrip">!AYZrMZsavE</span>
<span class="posteruid id_y5EgihFc">(ID:
<span class="hand" title="Highlight posts by this ID"
style="background-color: rgb(136, 179, 155); color: black;">y5EgihFc</span>)</span>
<span title="Australia" class="flag flag-au"></span>
</span>
Now I need to get the "countries" => "United States" and "Australia".
With the whole dataset (more than 120k entries), I was doing:
for i in data:
country = i.find_element(By.XPATH, './/span[contains(#class,"flag")]').get_attribute('title')
But after a while I got empty entries and I figured out than sometime the class of the country was completely changing from "flag something" to "bf something" or "cd something"
This is why I decided to go with the last children for each element:
for i in data:
country = i.find_element(By.XPATH, './/span[3]').get_attribute('title')
But again, after a while I got error again because sometime there were some <span class="postertrip">BLABLA</span> popping, moving the "country" location to "span[4]".
So, I changed for the following one:
for i in data:
country = i.find_element(By.XPATH, './/span[last()]').get_attribute('title')
But this last one always give me the second level child (posteruid child):
<span class="hand" title="Highlight posts by this ID"
style="background-color: rgb(136, 179, 155); color: black;">y5EgihFc</span>)
One thing that I'm certain: the country is ALWAYS the last child (span) of the first level of children.
So I'm out of ideas this is why I'm asking you this question.
Use the following xpath to always identify the last child of parent.
(//span[#class='nameBlock']//span[#title])[last()]
Code block.
for country in driver.find_elements(By.XPATH, "(//span[#class='nameBlock']//span[#title])[last()]"):
print(country.get_attribute("title"))
For this particular case, you can get the titles without calculating the child nodes. Just keep the nameBlock as root and create the xpath to point to the child which class will have the title ( flag, in this case). Like this:
//span[#class='nameBlock']/span[contains(#class,'flag')]

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Python Splinter Star Ratings

Given the star ratings under the "Recent Comments" section here,
I am trying to build a list of the star rating per comment shown on the page.
The trouble is that each star rating objects does not have a value.
For example, I can get an individual star object via xpath like this:
from splinter import Browser
url = 'https://www.greatschools.org/texas/harker-heights/3978-Harker-Heights-Elementary-School/'
browser.visit(url)
astar=browser.find_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[3]/div/div[2]/div[1]/div[2]/span/span[1]')
The rub is that I cannot seem to access the value (filled in or not) for the object astar.
Here's the HTML:
<div class="answer">
<span class="five-stars">
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
<span class="icon-star filled-star"></span>
</span>
</div>
UPDATE:
Some comments do not have star ratings at all, so I need to be able to determine if a particular comment has a star rating and, if so, what the rating is.
This seems helpful for at least getting a list of all stars. I used it to do this:
stars = browser.find_by_css('span[class="icon-star filled-star"]')
So if I can get a list showing the sequence of if a comment has a star rating (something like ratings = [1,0,1,1...]) and the sequence of all stars (i.e. ['Filled', 'Filled', 'Empty'...]), I think I can piece together the sequence.
One solution:
access the html attribute of each object like this:
#Get total number of comments
allcoms = len(browser.find_by_text('Overall experience'))
#Loop through all comments and gather into list
comments = []
#If pop-up box occurs, use div[4] instead of second div[5]
if browser.is_element_present_by_xpath('/html/body/div[5]/div[4]/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div[1]/div/div[2]'):
use='4'
else:
use='5'
for n in range(allcoms): #sometimes the second div[5] was div[4]
comments.append(browser.find_by_xpath('/html/body/div[5]/div['+use+']/div[2]/div[11]/div/div/div[2]/div/div/div[2]/div/div[2]/div['+str(n+1)+']/div/div[2]').value)
#Get all corresponding star ratings
#https://stackoverflow.com/questions/46468030/how-select-class-div-tag-in-splinter
ratingcode = []
ratings = browser.find_by_css('span[class="five-stars"]')
for a in range(len(comments)+2): #Add 2 to skip over first 2 ratings
if a<2: #skip first 2 and last 3 because these are other ratings - by just using range(len(comments)) above to get correct # before stopping
pass
else:
ratingcode.append(ratings[a].html)

Expression Engine, filter relationship's entry by category id

I have the channel Market and Family. Both have the same expressionengine's category group.
I want to print out all the entries of the channel Market with the category XY, and for each market I want to print ONLY the first family entry of category XY related to it.
In my solution, seems that the category parameter inside the relationship field "market-families" doesn't work. here is the code:
{exp:channel:entries channel="Market" category="{segment_2_category_id}" orderby="title" sort="asc"}
{if "{url_title}" == "{segment_3}"}
<li class="active">
{if:else}
<li>
{/if}
{market-families orderby="title" sort="asc" category="{segment_2_category_id}" limit="1"}
{title}
{/market-families}
</li>
{/exp:channel:entries}
Legend:
{segment_2_category_id} -> plugin to get the category id from a segment.
market-families -> Multiple relationship field inside channel Market
Thank you for any help :)
Have you tried manually entering the category id in the parameter instead of using the plugin just to verify that it's not the plugin?
I couldn't find any specific reference to the relationship field being able to use the category parameter in ExpressionEngine's documentation: http://ellislab.com/expressionengine/user-guide/modules/channel/relationships.html

Resources