Scraping using selenium

Scraping using selenium - python-3.x

Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.
Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.
URL: https://www.tradingview.com/symbols/LSE-TSCO/
HTML:
<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
—
</span>
Python Code:
url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source

To get the equity value.Induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.
driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)
You need to import below libraries.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

You can get the return on equity using xpath
equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text
print(equity)

The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.
WebDriverWait with find_element_* works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.
To fix this or reduce the issue, you can add code to wait until the document readyState is completed
Something like this can be used:
def wait_for_page_ready_state(driver):
wait = WebDriverWait(driver, 20)
def _ready_state_script(driver):
return driver.execute_async_script(
"""
var callback = arguments[arguments.length - 1];
callback(document.readyState);
""") == 'complete'
wait.until(_ready_state_script)
wait_for_page_ready_state(driver)
Then since you brought bs4 in play, this is where I would use it:
financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
This will give you every value you need from the financial card.
You can now print the value you need:
print(financials['Return on Equity (TTM)'])
Output:
'0.0663'
Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.
To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.
[EDIT]
After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).
def get_financial_card_values(default_value='—', threshold=.5):
financials = {}
while True:
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
try:
key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
"apply-overflow-tooltip"}).text.strip())
value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())
financials[key] = value
except AttributeError:
pass
number_of_updated_values = [value for value in financials.values() if value != default_value]
if len(number_of_updated_values) / len(financials) > threshold:
return financials
With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True. Just want to point this out, but I don't think it will happen.

Related

Scrape Product Image with BeautifulSoup (Error)

I need your help. I'm working on a telegram bot which sends me all the sales from amazon.
It works well but this function doesn't work properly. I have always the same error that, however, blocks the script
imgs_str = img_div.img.get('data-a-dynamic-image') # a string in Json format
AttributeError: 'NoneType' object has no attribute 'img'
def take_image(soup):
img_div = soup.find(id="imgTagWrapperId")
imgs_str = img_div.img.get('data-a-dynamic-image') # a string in Json format
# convert to a dictionary
imgs_dict = json.loads(imgs_str)
#each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
num_element = 0
first_link = list(imgs_dict.keys())[num_element]
return first_link
I still don't understand how to solve this issue.
Thanks for All!

From the looks of the error, soup.find didn't work.
Have you tried using images = soup.findAll("img",{"id":"imgTagWrapperId"})
This will return a list

Images are not inserted in HTML Page they are linked to it so you need wait until uploaded. Here i will give you two options;
1-) (not recommend cause there may be a margin of error) simply; you can wait until the image is loaded(for this you can use "time.sleep()"
2-)(recommend) I would rather use Selenium Web Driver. You also have to wait when you use selenium, but the good thing is that selenium has a unique function for this job.
I will show how make it with selenium;
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Chrome()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'imgTagWrapperId')))# I used what do you want find
print ("Page is ready!")
except TimeoutException:
print ("Loading took too much time!")
More Documention
Code example for way 1
Q/A for way 2

Stuck in loop <> Code doesn't want to pull anything except row 1

I am stuck in loop, I don't know what to change to make my code work normally...
problem is with CSV file, my file contains list of domains (freedommortgage.com, google.com, amd.com etc.) so when I run code, everything is fine at start, but then it keeps sending me same results all over:
the monthly total visits to freedommortgage.com is 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[#class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(Sorry for the long line of code, but I have to present everything, even tho focus is on the LAST PART of the code)
My question is simple:
Why does it keep taking row number 1 domain, and ignoring the row2 row3 row4, etc...?
Time = delay has to be 10, or more, to avoid captcha issue on this website
if anyone would try to run this, you have to edit name of csv file, and to have few domains in it in format google.com (not www.google.com) of course.

Looks like you're always accessing the same index everytime with:
domain = row[0]
Index 0 is the first item, hence why you keep getting the same value.
This post explains an alternative way to use a for loop in Python.
Accessing the index in 'for' loops?

Clicking links on the website to get contents in the bubbles with selenium

I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext

Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.

To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```

How to get all comments in youtube with selenium?

The webpage shows that there are 702 Comments.
target youtube sample
I write a function get_total_youtube_comments(url) ,many codes copied from the project on github.
project on github
def get_total_youtube_comments(url):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/usr/bin/chromedriver')
wait = WebDriverWait(driver,60)
driver.get(url)
SCROLL_PAUSE_TIME = 2
CYCLES = 7
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)
html.send_keys(Keys.PAGE_DOWN)
time.sleep(SCROLL_PAUSE_TIME * 3)
for i in range(CYCLES):
html.send_keys(Keys.END)
time.sleep(SCROLL_PAUSE_TIME)
comment_elems = driver.find_elements_by_xpath('//*[#id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
return all_comments
Try to parse all comments on a sample webpage https://www.youtube.com/watch?v=N0lxfilGfak.
url='https://www.youtube.com/watch?v=N0lxfilGfak'
list = get_total_youtube_comments(url)
It can get some comments ,only small party of all comments.
len(list)
60
60 is much less than 702,how to get all comments in youtube with selenium?
#supputuri,i can extract all comments with your code.
comments_list = driver.find_elements_by_xpath("//*[#id='content-text']")
len(comments_list)
709
print(driver.find_element_by_xpath("//h2[#id='count']").text)
717 Comments
comments_list[-1].text
'mistake at 23:11 \nin NOT it should return false if x is true.'
comments_list[0].text
'Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Python Course curriculum, Visit our Website: Use code "YOUTUBE20" to get Flat 20% off on this training.'
Why the comments number is 709 instead of 717 shown in page?

You are getting a limited number of comments as YouTube will load the comments as you keep scrolling down. There are around 394 comments left on that video you have to first make sure all the comments are loaded and then also expand all View Replies so that you will reach the max comments count.
Note: I was able to get 700 comments using the below lines of code.
# get the last comment
lastEle = driver.find_element_by_xpath("(//*[#id='content-text'])[last()]")
# scroll to the last comment currently loaded
lastEle.location_once_scrolled_into_view
# wait until the comments loading is done
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# load all comments
while lastEle != driver.find_element_by_xpath("(//*[#id='content-text'])[last()]"):
lastEle = driver.find_element_by_xpath("(//*[#id='content-text'])[last()]")
driver.find_element_by_xpath("(//*[#id='content-text'])[last()]").location_once_scrolled_into_view
time.sleep(2)
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))
# open all replies
for reply in driver.find_elements_by_xpath("//*[#id='replies']//paper-button[#class='style-scope ytd-button-renderer'][contains(.,'View')]"):
reply.location_once_scrolled_into_view
driver.execute_script("arguments[0].click()",reply)
time.sleep(5)
WebDriverWait(driver, 30).until(
EC.invisibility_of_element((By.CSS_SELECTOR, "div.active.style-scope.paper-spinner")))
# print the total number of comments
print(len(driver.find_elements_by_xpath("//*[#id='content-text']")))

There are a couple of things:
The WebElements within the website https://www.youtube.com/ are dynamic. So are the comments dynamically rendered.
With in the webpage https://www.youtube.com/watch?v=N0lxfilGfak the comments doesn't render unless user scrolls the following element within the Viewport.
The comments are with in:
<!--css-build:shady-->
Which applies, Polymer CSS Builder is used apply Polymer's CSS Mixin shim and ShadyDOM scoping. So some runtime work is still done to convert CSS selectors under the default settings.
Considering the above mentioned factors here's a solution to retrieve all the comments:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=N0lxfilGfak')
driver.execute_script("return scrollBy(0, 400);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
comments = []
my_length = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[#class='style-scope ytd-comment-renderer' and #id='content-text'][#slot='content']"))))
while True:
try:
driver.execute_script("window.scrollBy(0,800)")
time.sleep(5)
comments.append([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[#class='style-scope ytd-comment-renderer' and #id='content-text'][#slot='content']")))])
except TimeoutException:
driver.quit()
break
print(comment)

If you don't have to use Selenium I would recommend you to look at the google/youtube api.
https://developers.google.com/youtube/v3/getting-started
Example :
https://www.googleapis.com/youtube/v3/commentThreads?key=YourAPIKey&textFormat=plainText&part=snippet&videoId=N0lxfilGfak&maxResults=100
This would give you the first 100 results and gets you a token that you can append on the next request to get the next 100 results.

I'm not familiar with python, but I'll tell you the steps that I would do to get all comments.
First of all, if your code I think the main issue is with the
CYCLES = 7
According to this, you will be scrolling for 2 seconds 7 times. Since you are successfully grabbing 60 comments, fixing the above condition will solve your issue.
I assume you don't have any issue in finding elements on a website using locators.
You need to get the total comments to count to a variable as an int. (in your case, let's say it's COMMENTS = 715)
Define another variable called VISIBLECOUNTS = 0
The use a while loop to scroll if the COMMENTS > VISIBLECOUNTS
The code might look like this ( really sorry if there are syntax issues )
// python - selenium command to get all comments counts.
COMMENTS = 715
(715 is just a sample value, it will change upon the total comments count)
VISIBLECOUNTE = 0
SCROLL_PAUSE_TIME = 2
while VISIBLECOUNTS < COMMENTS :
html.send_keys(Keys.END)
time.sleep(SCROLL_PAUSE_TIME)
VISIBLECOUNTS = len(driver.find_elements_by_xpath('//ytm-comment-thread-renderer'))
With this, you will be scrolling down until the COMMENTS = VISIBLECOUNTS. Then you can grab all the comments as all of them share the same element attributes such as ytm-comment-thread-renderer
Since I'm not familiar with python I'll add the command to get the comments to count from js. you can try this on your browser and convert it into your python command
Run the bellow queries in your console and check.
To get total comments count
var comments = document.querySelector(".comment-section-header-text").innerText.split(" ")
//We can get the text value "Comments • 715" and split by spaces and get the last value
Number(comments[comments.length -1])
//Then convirt string "715" to int, you just need to do these in python - selenium
To get active comments count
$x("//ytm-comment-thread-renderer").length
Note: if it's hard to extract the values you still can use the selenium js executor and do the scrolling with js until all the comments are visible. But I guess it's not hard to do it in python since the logic is the same.
I'm really sorry about not being able to add the solution in python.
But hope this helped.
cheers.

The first thing you need to do is scroll down the video page to load all comments:
$actualHeight = 0;
$nextHeight = 0;
while (true) {
try {
$nextHeight += 10;
$actualHeight = $this->driver->executeScript('return document.documentElement.scrollHeight;');
if ($nextHeight >= ($actualHeight - 50 ) ) break;
$this->driver->executeScript("window.scrollTo(0, $nextHeight);");
$this->driver->manage()->timeouts()->implicitlyWait = 10;
} catch (Exception $e) {
break;
}
}

Scraping a hidden variable from a website (Bs4)

I am trying to learn how to scrape a website. I am using Python3 and BS4.
I am stuck on a specific problem.
Example:http://www2.hm.com/en_in/productpage.0648256001.html
I am unable to scrape the "Sizes" available in the dropdown menu and whether they are sold out, in the above link. I went through the whole source code but couldn't figure out under which tags does the data exist. I am guessing it must be a hidden variable or something?

Okay, so I tracked XHR requests the website makes and I wrote the code below. Basically, it uses Selenium to get the value of productArticleDetails variable and the URL of the availability endpoint (I could have hardcoded it, but I found the variable it's in, so why not use it).
from itertools import chain
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www2.hm.com/en_in/productpage.0648256002.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
browser = webdriver.Chrome()
browser.get(url)
details = browser.execute_script('return productArticleDetails;')
availability_url = browser.execute_script('return hm.options.product.productAvailabilityServiceUrl;')
browser.quit()
variants = {} # e.g one product can be available in different colors
for key, value in details.items():
# there is a lot of information in the details, not only product variants
try:
if 'whitePrice' in value:
variants[key] = value
except AttributeError:
pass
# 'http://www2.hm.com/en_in/getAvailability?variants=0648256001,0648256002,0648256003,0648256006,0648256007,0648256008'
payload = {'variants': ','.join(variants.keys())}
r = requests.get(urljoin(url, availability_url), params=payload)
available_sizes = r.json()['availability']
# r.json() contains:
# availability: ["0648256001001", "0648256001002", "0648256001007",…]
# fewPieceLeft: []
sizes = chain.from_iterable(variant['sizes'] for variant in variants.values())
for size in sizes:
availability = size['sizeCode'] in available_sizes
size['available'] = availability # True/False, feel free to implement handling "fewPieceLeft"
# Output
for variant in variants.values():
print(f'Variant: {variant["name"]}') # color in that case
print('\tsizes:')
for size in variant['sizes']:
print(f'\t\t{size["name"]} -> {"Available" if size["available"] else "Sold out"}')
Output:
Variant: Light beige/Patterned
sizes:
32 -> Available
34 -> Available
36 -> Sold out
...
Variant: Orange
sizes:
32 -> Available
...
The advantage of this approach is that you gain access to a lot of details such as 'whitePrice': 'Rs. 1,299',, 'careInstructions': ['Machine wash at 30°'], 'composition': ['Viscose 100%'], description and some more. You can take a look yourself:
import pprint
pprint.pprint(variants)
The disadvantage is that you need to use Selenium and download a driver, but to be fair I used Selenium only to get the variables, since extracting this nested JS object with regex seems impossible to me (correct me if I'm wrong) and browser.execute_script('return productArticleDetails;') is very clear and concise.
There are no hidden variables and it's totally possible to get the sizes with BeautifulSoup, each size is a <li>:
<li class="item" data-code="0648256001002">
<div class="picker-option"><button type="button" class="option"><span class="value">34</span></button></div>
</li>
You need to match the data-code attribute of the size to the data-articlecode attribute of the "product variant":
<li class="list-item">
<a title="Light beige/Patterned" data-color="Light beige/Patterned"
data-articlecode="0648256001">
...
</a>
</li>
I encourage you to implement this yourself, but I'll try to code it in the evening/tomorrow to make the answer complete. However, the website is rendered with JavaScript and in the response to the GET request, you won't get the entire HTML you see in the Elements tab of the DevTools. You can use Selenium to do that, but personally, I'd use Requests-HTML

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping using selenium - python-3.x

You can get the return on equity using xpath equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text print(equity)

Related

Scrape Product Image with BeautifulSoup (Error)

Stuck in loop <> Code doesn't want to pull anything except row 1

Clicking links on the website to get contents in the bubbles with selenium

How to get all comments in youtube with selenium?

Scraping a hidden variable from a website (Bs4)

Categories

Resources