How can I find the element in this page for scraping? - python-3.x

The website is basically a colour swatch as can be seen here:
https://prnt.sc/sux913
Once I get to this page I need to to go through all colours and for the program to provide me with the stock quantity for a size i've specified. (Sometimes somebody wants to know the available quantity in all colours in a large.). At this point i'm pretty lost as i can't find the element i'm meant to be referencing in my code. If the specified size is 'L' , i need to go through each colour and provide the Qty for L. For example, Black L - 9, Navy L - 23, Red L - 334
<div class="prod__options">
<span class="attr_text">1.Pick a Color</span>
<p id="prod_color_swatch_area_id" class ="prod_color_swatch_area">
<a id="prod_color_box_Black" href ="javaScript: categoryDisplayJS.displayColorSelected('BK ', 'Black'); categoryDisplayJS.displayPDPErrorSection(null,null,null); setCurrentId('prod_color_box_Black'); if(submitRequest()){ cursor_wait();
wc.render.updateContext('ProductPageMatrixDisplay_Context',{'productId':'497940','colorSelected':'Black'});}" title ="Black">
<img src="https://a248.e.akamai.net/f/248/9086/10h/origin-d5.scene7.com/is/image/Hanesbrands/HBI_498P_Black_sw?$productSwatch$" border="1" onclick="setOptions('Black', true)" />
</a>
<a id="prod_color_box_Smoke Gray" href ="javaScript: categoryDisplayJS.displayColorSelected('8Q', 'Smoke Gray'); categoryDisplayJS.displayPDPErrorSection(null,null,null); setCurrentId('prod_color_box_Smoke Gray'); if(submitRequest()){ cursor_wait();
wc.render.updateContext('ProductPageMatrixDisplay_Context',{'productId':'497940','colorSelected':'Smoke Gray'});}" title ="Smoke Gray">
<img src="https://a248.e.akamai.net/f/248/9086/10h/origin-d5.scene7.com/is/image/Hanesbrands/HBI_498P_SmokeGray_sw?$productSwatch$" border="1" onclick="setOptions('SmokeGray', true)" />
</a>
<a id="prod_color_box_Charcoal Heather" href ="javaScript: categoryDisplayJS.displayColorSelected('HL', 'Charcoal Heather'); categoryDisplayJS.displayPDPErrorSection(null,null,null); setCurrentId('prod_color_box_Charcoal Heather'); if(submitRequest()){ cursor_wait();
wc.render.updateContext('ProductPageMatrixDisplay_Context',{'productId':'497940','colorSelected':'Charcoal Heather'});}" title ="Charcoal Heather">
<img src="https://a248.e.akamai.net/f/248/9086/10h/origin-d5.scene7.com/is/image/Hanesbrands/HBI_498P_CharcoalHeather_sw?$productSwatch$" border="1" onclick="setOptions('CharcoalHeather', true)" />
</a>
<a id="prod_color_box_Navy" href ="javaScript: categoryDisplayJS.displayColorSelected('NY', 'Navy'); categoryDisplayJS.displayPDPErrorSection(null,null,null); setCurrentId('prod_color_box_Navy'); if(submitRequest()){ cursor_wait();
wc.render.updateContext('ProductPageMatrixDisplay_Context',{'productId':'497940','colorSelected':'Navy'});}" title ="Navy">
<img src="https://a248.e.akamai.net/f/248/9086/10h/origin-d5.scene7.com/is/image/Hanesbrands/HBI_498P_Navy_sw?$productSwatch$" border="1" onclick="setOptions('Navy', true)" />
The last code i tried was as below:
html_content = requests.get(browser.current_url)
print(browser.current_url) # just to check what the URL is
print(html_content.raise_for_status())
soup = bs4.BeautifulSoup(html_content.text, 'html.parser')
ele_color_swatch = soup.select('#prod_color_swatch_area_id')
print(ele_color_swatch)
However, that just gave:
https://www... (some long url)
None
[]

Related

How would I scrape these nested img tags?

I was scraping this site from title and also trying to scrape images followed by title. turns out when scraped the following data was returned:
<div itemscope itemtype="https://schema.org/ItemList" class="group card-8-group-1 clearfix">
<meta itemprop="itemListOrder" content="https://schema.org/ItemListOrderDescending" />
<article itemprop="itemListElement" itemscope itemtype="https://schema.org/Article" class="card card-1 news-card-1 card-type-article type-article" data-sponsorship-type="card" data-sponsorship-article-id="1qo8sz0z1kaqb1dpj038v8658h" data-sponsorship-article-type="article" data-sponsorship-primary-tag="1pgecmpab62ei1akyb084izq3o" data-sponsorship-secondary-tag="22doj4sgsocqpxw45h607udje">
<a data-side="link" href="/en/news/spurs-investigation-aurier-appears-break-lockdown-protocols/1qo8sz0z1kaqb1dpj038v8658h" itemprop="url" data-sponsorship-slot="card" data-sponsorship-slot-id="front" class="type-article">
<div class="picture article-image" data-module="responsive-picture">
<img class="picture__image picture__image--lazyload" data-srcset="&quality=60&w=640 320w,&quality=60&w=560 480w,&quality=60&w=690 740w,&quality=60&w=800 980w,&quality=60&w=970 1580w" />
<noscript class="picture__polyfill"> <img src="https://images.daznservices.com/di/library/GOAL/5f/da/serge-aurier_191f5i34z69us1fausrs9k0mjk.jpg?t=1445827096&quality=60&h=170" alt="Serge Aurier" /> </noscript>
</div>
<div class="title">
<h3 title="Spurs launch investigation as Aurier appears to break lockdown protocols for a third time" itemprop="headline">Aurier appears to break lockdown protocols for a third time</h3>
<div class="image" data-sponsorship-slot="card" data-sponsorship-slot-id="image"></div>
</div>
it appears the page is using lazy loading. my Question is how can I extract the img with its full zoom?
To get full scale image, just replace w=55 to w=970 or bigger in image URL manually.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.goal.com/en/premier-league/2kwbbcootiqqgmrzs6o5inle5'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for title, image in zip(soup.select('.card-type-article h3'),
soup.select('.card-type-article img')):
title = title.get_text(strip=True)
full_img_url = image['src'].replace('w=55', 'w=970')
print('{:<70}{}'.format(title, full_img_url))
Prints:
Wenger calls for FFP reform amid Newcastle takeover talk https://images.daznservices.com/di/library/GOAL/63/cd/arsene-wenger-2019_13luew9ltpa2g1l1r6ziuxpwbw.jpg?t=1363081390&quality=60&w=970
'Special Havertz is half-Ozil, half-Ballack & would thrive in PL' https://images.daznservices.com/di/library/GOAL/cc/18/kai-havertz_7sugon9o7ljy1fg2xzkv1mqcm.jpg?t=-1186202400&quality=60&w=970
Solskjaer: I'd rather a hole in my squad than an asshole https://images.daznservices.com/di/library/GOAL/78/f2/ole-gunnar-solskjaer-manchester-united-2019-20_1vfk6liknrjlx1r8aumegh4cxe.jpg?t=-749345265&quality=60&w=970
Maguire praises Man Utd's 'safe' training return https://images.daznservices.com/di/library/GOAL/5d/e8/harry-maguire-man-utd_13ewrih27ahmb13i1zxfjrhrp8.jpg?t=-444094625&quality=60&w=970
Jorginho's agent opens door for Juve move https://images.daznservices.com/di/library/GOAL/69/da/jorginho-chelsea-2019-20_15zh5m3ojefx0zl1ei7qsyc14.jpg?t=-1675997073&quality=60&w=970
Premier League clubs near approval for contact training https://images.daznservices.com/di/library/GOAL/79/ce/mohamed-salah-dejan-lovren-liverpool-training_7zq70upa8l1618svdzls077xn.jpg?t=143669454&quality=60&w=970
Ceballos reiterates desire to succeed at Real Madrid https://images.daznservices.com/di/library/GOAL/97/c6/dani-ceballos-arsenal_1sywf8w828w4b193xoz5c82uuf.jpg?t=-1552361252&quality=60&w=970

How can I get texts with certain criteria in python with selenium? (texts with certain siblings)

It's really tricky one for me so I'll describe the question as detail as possible.
First, let me show you some example of html.
....
....
<div class="lawcon">
<p>
<span class="b1">
<label> No.1 </label>
</span>
</p>
<p>
"I Want to get 'No.1' label in span if the div[#class='lawcon'] has a certain <a> tags with "bb" title, and with a string of 'Law' in the text of it."
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Law Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.2 </label>
</p>
<p>
"But I don't want to get No.2 label because, although it has <a> tag with "bb" title, but it doesn't have a text of law in it"
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Just Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.3 </label>
</p>
<p>
"If there are multiple <a> tags with the right criteria in a single div, I want to get span(No.3) for each of those" <a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Lawyer</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">By the Law</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">But not this one</a>
...
...
...
So, here is the thing. I want to extract the text of (e.g. No.1) in div[#class='lawcon'] only if the div has a tag with "bb" title, with a string of 'Law' in it.
If inside of the div, if there isn't any tag with "bb" title, or string of "Law" in it, the span should not be collected.
What I tried was
div_list = [div.text for div in driver.find_elements_by_xpath('//span[following-sibling::a[#title="bb"]]')]
But the problem is, when it has multiple tag with right criteria in a single div, it only return just one div.
What I want to have is a location(: span numbers) list(or tuple) of those text of tags
So it should be like
[[No.1 - Law Power], [No.3 - Lawyer], [No.3 - By the Law]]
I'm not sure I have explained enough. Thank you for your interests and hopefully, enlighten me with your knowledge! I really appreciate it in advance.
Here is the simple python script to get your desired output.
links = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]")
linkData = []
for link in links:
currentList = []
currentList.append(link.find_element_by_xpath("./ancestor::div[#class='lawcon']//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer'], ['No.3-By the Law']]
I am not sure why you want the output in that format. I would prefer the below approach, so that you will get to know how many divs have the matching links and then you can access the links from the output based on the divs. Just a thought.
divs = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]//ancestor::div[#class='lawcon']")
linkData = []
for div in divs:
currentList = []
for link in div.find_elements_by_xpath(".//a[#title='bb' and contains(.,'Law')]"):
currentList.append(div.find_element_by_xpath(".//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer', 'No.3-By the Law']]
As your requirement is to extract the texts No.1 and so on, which are within a <label> tag, you have to induce WebDriverWait for the visibility_of_all_elements_located() and you will have only 2 matches (against your expectation of 3) and you can use the following Locator Strategy:
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='lawcon']//a[#title='bb' and contains(.,'Law')]//preceding::label[1]")))])

Python web scraping style content

I just want to pull data from HTML by using python.(I need data = 20%)
Any help on this would be greatly appreciated.
<div class="ratings-container">
<div class="ratings">
<div class="ratings active" style="width: 20%"></div>
</div>
</div>
I don't know how to get the style content. The following similar code's result is NULL:
mratingNew = (tag.findAll('div',attrs={"class":"ratings active"}))
for i in range(len(muserName)):
print(mratingNew[i].['style'])
You can get width with using find and can split it according to :
from bs4 import BeautifulSoup
html = '''<div class="ratings-container">
<div class="ratings">
<div class="ratings active" style="width: 20%"></div>
</div>
</div>'''
soup = BeautifulSoup(html,"html.parser")
finddiv = soup.find('div',attrs={'class':'ratings active'})
style = finddiv['style']
style = style.split(':',1)[-1]
print style
OUTPUT :
20%
If you have more than one width with the same class name like :
html = '''<div class="ratings-container">
<div class="ratings">
<div class="ratings active" style="width: 20%"></div>
<div class="ratings active" style="width: 40%"></div>
<div class="ratings active" style="width: 30%"></div>
</div>
</div>'''
You need to use findAll and split it one by one
find_last_div = soup.findAll('div',attrs={'class':'ratings active'})
for width_value in find_last_div:
width_Get = width_value['style'].split(':',1)[-1]
print width_Get
OUTPUT :
20%
40%
30%

Scraping multiple similar lines with python

Using a simple request I'm trying to get from this html page some information stored in "alt". The problem is that, within each instance, the information is separated in multiple lines that start with "img", and when I try to access it, I can only read the first instance of "img" and not the rest, but I'm not sure how to do it. Here's the HTML text:
<div class="archetype-tile-description-wrapper">
<div class="archetype-tile-description">
<h2>
<span class="deck-price-online">
Golgari Midrange
</span>
<span class="deck-price-paper">
Golgari Midrange
</span>
</h2>
<div class="manacost-container">
<span class="manacost">
<img alt="b" class="common-manaCost-manaSymbol sprite-mana_symbols_b" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
<img alt="g" class="common-manaCost-manaSymbol sprite-mana_symbols_g" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
</span>
</div>
<ul>
<li>Jadelight Ranger</li>
<li>Merfolk Branchwalker</li>
<li>Vraska's Contempt</li>
</ul>
</div>
</div>
Having said that, what I'm looking to get from this is both "b" and "g" and store them in a single variable.
You can probably grab those <img> elements with the class "common-manaCost-manaSymbol" like this:
imgs = soup.find_all("img",{"class":"common-manaCost-manaSymbol"})
and then you can iterate over each <img> and grab the alt property of it.
alts = []
for i in imgs:
alts.append(i['alt'])
or with a list comprehension
alts = [i['alt'] for i in imgs]

How to select only divs with specific children span with xpath python

I am currently trying to scrap information of a particular ecommerce site and i only want to get product information like product name, price, color and sizes of only products whose prices have been slashed.
i am currently using xpath
this is my python scraping code
from lxml import html
import requests
class CategoryCrawler(object):
def __init__(self, starting_url):
self.starting_url = starting_url
self.items = set()
def __str__(self):
return('All Items:', self.items)
def crawl(self):
self.get_item_from_link(self.starting_url)
return
def get_item_from_link(self, link):
start_page = requests.get(link)
tree = html.fromstring(start_page.text)
names = tree.xpath('//span[#class="name"][#dir="ltr"]/text()')
print(names)
Note this is not the original URL
crawler = CategoryCrawler('https://www.myfavoriteecommercesite.com/')
crawler.crawl()
When the program is Run ... These are the HTML Content Gotten from the E-commerce Site
Div of Products With Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-22%</span>
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
<span class="price -old ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="500000">500,000</span>
</span>
</span>
</div>
div
Div of Products with No Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
</span>
</div>
div
Now this is my exact Question
i want to know how to select only the parent divs i.e
div class="price-container clearfix"> that also contains any of these children span classes
span class="price -old "> or
span class="sale-flag-percent">
Thank you all
One solution would be get all <div class="price-container clearfix"> and iterate, checking with the string of the whole element that your keywords exist.
But a better solution would be to use conditionals with xpath:
from lxml import html
htmlst = 'your html'
tree=html.fromstring(htmlst)
divs = tree.xpath('//div[#class="price-container clearfix" and .//span[#class = "price -old " or #class = "sale-flag-percent"] ]')
print(divs)
This get all divs where class="price-container clearfix" and then check if contains span with the searched classes.

Resources