I am trying to learn how to scrape a website. I am using Python3 and BS4.
I am stuck on a specific problem.
Example:http://www2.hm.com/en_in/productpage.0648256001.html
I am unable to scrape the "Sizes" available in the dropdown menu and whether they are sold out, in the above link. I went through the whole source code but couldn't figure out under which tags does the data exist. I am guessing it must be a hidden variable or something?
Okay, so I tracked XHR requests the website makes and I wrote the code below. Basically, it uses Selenium to get the value of productArticleDetails variable and the URL of the availability endpoint (I could have hardcoded it, but I found the variable it's in, so why not use it).
from itertools import chain
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www2.hm.com/en_in/productpage.0648256002.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
browser = webdriver.Chrome()
browser.get(url)
details = browser.execute_script('return productArticleDetails;')
availability_url = browser.execute_script('return hm.options.product.productAvailabilityServiceUrl;')
browser.quit()
variants = {} # e.g one product can be available in different colors
for key, value in details.items():
# there is a lot of information in the details, not only product variants
try:
if 'whitePrice' in value:
variants[key] = value
except AttributeError:
pass
# 'http://www2.hm.com/en_in/getAvailability?variants=0648256001,0648256002,0648256003,0648256006,0648256007,0648256008'
payload = {'variants': ','.join(variants.keys())}
r = requests.get(urljoin(url, availability_url), params=payload)
available_sizes = r.json()['availability']
# r.json() contains:
# availability: ["0648256001001", "0648256001002", "0648256001007",…]
# fewPieceLeft: []
sizes = chain.from_iterable(variant['sizes'] for variant in variants.values())
for size in sizes:
availability = size['sizeCode'] in available_sizes
size['available'] = availability # True/False, feel free to implement handling "fewPieceLeft"
# Output
for variant in variants.values():
print(f'Variant: {variant["name"]}') # color in that case
print('\tsizes:')
for size in variant['sizes']:
print(f'\t\t{size["name"]} -> {"Available" if size["available"] else "Sold out"}')
Output:
Variant: Light beige/Patterned
sizes:
32 -> Available
34 -> Available
36 -> Sold out
...
Variant: Orange
sizes:
32 -> Available
...
The advantage of this approach is that you gain access to a lot of details such as 'whitePrice': 'Rs. 1,299',, 'careInstructions': ['Machine wash at 30°'], 'composition': ['Viscose 100%'], description and some more. You can take a look yourself:
import pprint
pprint.pprint(variants)
The disadvantage is that you need to use Selenium and download a driver, but to be fair I used Selenium only to get the variables, since extracting this nested JS object with regex seems impossible to me (correct me if I'm wrong) and browser.execute_script('return productArticleDetails;') is very clear and concise.
There are no hidden variables and it's totally possible to get the sizes with BeautifulSoup, each size is a <li>:
<li class="item" data-code="0648256001002">
<div class="picker-option"><button type="button" class="option"><span class="value">34</span></button></div>
</li>
You need to match the data-code attribute of the size to the data-articlecode attribute of the "product variant":
<li class="list-item">
<a title="Light beige/Patterned" data-color="Light beige/Patterned"
data-articlecode="0648256001">
...
</a>
</li>
I encourage you to implement this yourself, but I'll try to code it in the evening/tomorrow to make the answer complete. However, the website is rendered with JavaScript and in the response to the GET request, you won't get the entire HTML you see in the Elements tab of the DevTools. You can use Selenium to do that, but personally, I'd use Requests-HTML
Related
As the above title mentions, i'm trying to create a dictionary, similar to
article name:link
I use BS4 to dive into the html and obtain the stuff I need (as it's a different class every time i'm using a range to get the first 5 and looping through)
data = requests.get("https://www.marketingdive.com")
soup = BS(data.content, 'html5lib')
top_story = []
for i in range(6):
items = soup.find("a", {"class": f"analytics t-dash-top-{i}"})
#print(items.get('href'))
top_story.append(items)
print(top_story)
The end result is the following:
[None, <a class="analytics t-dash-top-1" href="/news/youtube-shorts-revenue-sharing-creator-economy-TikTok/632272/">
YouTube brings revenue sharing to Shorts as battle for creator talent intensifies
</a>, <a class="analytics t-dash-top-2" href="/news/Walmart-TikTok-Snapchat-Gen-Z-retail-commerce-ads/632191/">
Walmart weds data to popular apps like TikTok in latest ad play
</a>, <a class="analytics t-dash-top-3" href="/news/retail-media-global-ad-spend-groupm/632269/">
Retail media makes up 11% of global ad spend, GroupM says
</a>, <a class="analytics t-dash-top-4" href="/news/mike-hard-lemonade-gen-z-pto/632267/">
Mike’s Hard Lemonade pays consumers to take PTO
</a>, <a class="analytics t-dash-top-5" href="/news/samsung-nbcuniversal-tonight-show-metaverse-fortnite/632194/">
Samsung, NBCUniversal bring Rockefeller Center to the metaverse
</a>]
I have tried splitting the strings and trying to obtain only the href (as per the docs) from the information and using other solutions on here but am at a loss and the only thing I can think of is that I have missed a step somewhere. Any answers and comments as to where I can fix this would be appreciated.
from bs4 import BeautifulSoup
import requests
from pprint import pp
from urllib.parse import urljoin
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
goal = {x.get_text(strip=True): urljoin(url, x['href'])
for x in soup.select('a[class^="analytics t-dash-top"]')}
pp(goal)
main('https://www.marketingdive.com/')
I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext
Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.
To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```
Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.
Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.
URL: https://www.tradingview.com/symbols/LSE-TSCO/
HTML:
<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
—
</span>
Python Code:
url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source
To get the equity value.Induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.
driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)
You need to import below libraries.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
You can get the return on equity using xpath
equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text
print(equity)
The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.
WebDriverWait with find_element_* works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.
To fix this or reduce the issue, you can add code to wait until the document readyState is completed
Something like this can be used:
def wait_for_page_ready_state(driver):
wait = WebDriverWait(driver, 20)
def _ready_state_script(driver):
return driver.execute_async_script(
"""
var callback = arguments[arguments.length - 1];
callback(document.readyState);
""") == 'complete'
wait.until(_ready_state_script)
wait_for_page_ready_state(driver)
Then since you brought bs4 in play, this is where I would use it:
financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
This will give you every value you need from the financial card.
You can now print the value you need:
print(financials['Return on Equity (TTM)'])
Output:
'0.0663'
Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.
To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.
[EDIT]
After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).
def get_financial_card_values(default_value='—', threshold=.5):
financials = {}
while True:
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
number_of_updated_values = [value for value in financials.values() if value != default_value]
if len(number_of_updated_values) / len(financials) > threshold:
return financials
With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True. Just want to point this out, but I don't think it will happen.
Web URL: https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times
I want to parse the HTML as below:
I want to get all hrefs within the < li > elements and the highlighted text. I tried the code
elementList = driver.find_element_by_class_name('block-wysiwyg').find_elements_by_tag_name("li")
for i in range(len(elementList)):
driver.find_element_by_class_name('blcokwysiwyg').find_elements_by_tag_name("li").get_attribute("href")
But the block returned none.
Can anyone please help me with the above code?
I suppose it will fetch you the required content.
import requests
from bs4 import BeautifulSoup
link = 'https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times'
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".block-wysiwyg li"):
item_text = item.get_text(strip=True)
item_link = item.select_one("a[href]").get("href")
print(item_text,item_link)
Try is this way:
coronas = driver.find_element_by_xpath("//div[#class='block-wysiwyg']/ul/li")
hr = coronas.find_element_by_xpath('./a')
print(coronas.text)
print(hr.get_attribute('href'))
Output:
The coronavirus is touching the lives of all Americans, but race, age, and income play a big role in the exact ways the virus — and the stalled economy — are affecting people. Here's what that means.
https://www.ipsos.com/en-us/america-under-coronavirus
there is a website,link:view-source:https://www.zhihu.com/people/weizhi-xiazhi/followers.
and when i use
import url
from scrapy.selector import Selector
url = 'https://www.zhihu.com/people/weizhi-xiazhi/followers'
content = urllib.request.urlopen(url).read()
content = content.decode('utf-8')
Selector(text=content).xpath('' // div[ #class ="ContentItem-head"] // a[# class ="UserLink-link" and # target="_blank"]'').extract()[0]
to extract the information,there only have a list of 3 element ,which is supposed to have more than 3 element. i wonder why.thanks in advance!
The website is loading more followers from javascript after the first query, you could look into selenium for rendering javascript, something like this if you are using phatomJS:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('https://www.zhihu.com/people/weizhi-xiazhi/followers')
driver.implicitly_wait(10) #wait some time to load
elements = driver.find_elements_by_xpath('//*[#class="UserItem-title"]/descendant::a')
for e in elements:
print(e.get_attribute("href"))
Note that I opted for a less "restrictive" expression so that it is less sensible to small website changes.