In a project I am doing, I am telling Selenium to go and scrape the data on the next page, which has the exact same URL.
My code:
driver = webdriver.Chrome()
driver.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances")
iframe1 = driver.find_element_by_id('tokeholdersiframe')
driver.switch_to.frame(iframe1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
token_holders = soup.find_all('tr')
driver.find_element_by_link_text('Next').click()
time.sleep(10)
token_holders2 = soup.find_all('tr') #I get the data from previous page (exact same as token_holder) rather than the new data.
However, Selenium doesn't update and I still get the same data from the previous page.
I tried using an implicit wait after the click:
driver.implicitly_wait(30)
but it doesn't work. I also tried resetting the soup to the driver.page_source, as well as making the driver refind the iframe using driver.find_element_by_id("id"), but neither work.
From the question I assume selenium is not waiting for the next page to load. One method of ensuring this happens (while not the most elegant) is to use known elements on the current page that you know will change and wait for that change to happen after clicking. You can use implicit wait see https://selenium-python.readthedocs.io/waits.html for details on how you can do this.
Alternatively, you can add an explicit wait after your click. ie
from time import sleep
...
driver.click(..);
sleep(0.5) # Wait for half a second
# Scrape the page
After you create a soup it won't dynamically update to reflect the driver.page_source. You need to create a new instance of BeautifulSoup and pass the updated page source.
token_holders = soup.find_all('tr')
driver.find_element_by_link_text('Next').click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
token_holders2 = soup.find_all('tr')
>>> token_holders[1:]
[<tr><td>1</td><td><span>0xd35a2d8c651f3eba4f0a044db961b5b0ccf68a2d</span></td><td>310847219.011683</td><td>31.0847%</td></tr>,
<tr><td>2</td><td><span>0xe17c20292b2f1b0ff887dc32a73c259fae25f03b</span></td><td>200000001</td><td>20.0000%</td></tr>,
...
]
>>> token_holders2[1:]
[<tr><td>51</td><td><span>0x5473621d6d5f68561c4d3c6a8e23f705c8db7661</span></td><td>687442.69121294</td><td>0.0687%</td></tr>,
<tr><td>52</td><td><span>0xbc14ca2a57ea383a94281cc158f34870be345eb6</span></td><td>619772.39698</td><td>0.0620%</td></tr>,
...
]
Related
I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
I am trying to take out entire review of a product(remaining half of the review is display after clicking read more. but I am still not able to do so.It is not displaying entire content of a review, which get dispalyed after clicking read more option. Below is the code , which click the readmore option and also get data from the website
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
response = requests.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
data = BeautifulSoup(response.content, 'lxml')
chromepath = r"C:\Users\Mohammed\Downloads\chromedriver.exe"
driver=webdriver.Chrome(chromepath)
driver.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
d = driver.find_element_by_class_name("_1EPkIx")
d.click()
title = data.find_all("p",{"class" : "_2xg6Ul"})
text1 = data.find_all("div",{"class" : "qwjRop"})
name = data.find_all("p",{"class" : "_3LYOAd _3sxSiS"})
for t2, t , t1 in zip(title,text1,name) :
print(t2.text,'\n',t.text,'\n',t1.text)
To get the full reviews, It is necessary to click on those READ MORE buttons to unwrap the rest. As you have already used selenium in combination with BeautifulSoup, I've modified the script to follow the logic. The script will first click on those READ MORE buttons. Once it is done, it will then parse all the titles and reviews from there. You can now get the titles and reviews from multiple pages (upto 4 pages).
import time
from bs4 import BeautifulSoup
from selenium import webdriver
link = "https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page={}&pid=MOBF85V7A6PXETAX"
driver = webdriver.Chrome() #If necessary, define the chrome path explicitly
for page_num in range(1,5):
driver.get(link.format(page_num))
[item.click() for item in driver.find_elements_by_class_name("_1EPkIx")]
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
for items in soup.select("._3DCdKt"):
title = items.select_one("p._2xg6Ul").text
review = ' '.join(items.select_one(".qwjRop div:nth-of-type(2)").text.split())
print(f'{title}\n{review}\n')
driver.quit()
I am trying to write a bot to add items to my cart then purchase them for me because I need to make very regular purchases and it becomes tedious to purchase them myself.
from bs4 import BeautifulSoup
import requests
import numpy as np
page = requests.get("http://www.onlinestore.com/shop")
soup = BeautifulSoup(page.content, 'html.parser')
try:
for i in soup.find_all('a'):
if "shop" in i['href']:
shop_page = requests.get("http://www.onlinestore.com" + i['href'])
item_page = BeautifulSoup(shop_page.content, 'html.parser')
for h in item_page.find_all('form', class_="add"):
print(h['action'])
try:
shop_page = requests.get("http://www.online.com" + h['action'])
except:
print("None left")
for h in item_page.find_all('h1', class_="protect"):
print(h.getText())
except:
print("either ended or error occured")
checkout_page = requests.get("http://www.onlinestore.com/checkout")
checkout = BeautifulSoup(checkout_page.content, 'html.parser')
for j in checkout.find_all('strong', id_="total"):
print(j)
I was having some trouble checking out the products because the items don't carry over. Is there a way that I can implement cookies so that it keeps track my items I have added to cart?
Thanks
Even though requests is not a browser, it still can persist headers and cookies across multiple requests, but if you use a Session:
with requests.Session() as session:
page = session.get("http://www.onlinestore.com/shop")
# use "session" instead of "requests" later on
Note that you would also get a performance boost for free because of the persistent connection:
if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
Depending on how the cart and checkout on this specific online store are implemented this may or may not work. But, at least, this is a step forward.
I'm trying to get the content from this web page "http://www.fibalivestats.com/u/ACBS/333409/pbp.html" with this code:
r = requests.get("http://www.fibalivestats.com/u/ACBS/333409/pbp.html")
if r.status_code != 200:
print("Error!!!")
html = r.content
soup = BeautifulSoup(html, "html.parser")
print(soup)
And I get the template of the page but not the data associated to each tag.
How can I get the data? I'm new in Python.
In this case you have a situation in which the Javascript is not being triggered, thus it is not filling in the elements. It is because there are no DOM elements to be "ready" which normally trigger Javascript actions. I'd suggest you to use a webdriver such as Selenium, as exemplified in here.
It will mimick a Browser and the Javascript will be executed. An example bellow.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.fibalivestats.com/u/ACBS/333409/pbp.html")
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")
I have a page that i need to get the source to use with BS4, but the middle of the page takes 1 second(maybe less) to load the content, and requests.get catches the source of the page before the section loads, how can I wait a second before getting the data?
r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 )
soup = BeautifulSoup(r.content, 'html.parser')
a = soup.find_all('section', 'wrapper')
The page
<section class="wrapper" id="resultado_busca">
It doesn't look like a problem of waiting, it looks like the element is being created by JavaScript, requests can't handle dynamically generated elements by JavaScript. A suggestion is to use selenium together with PhantomJS to get the page source, then you can use BeautifulSoup for your parsing, the code shown below will do exactly that:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://legendas.tv/busca/walking%20dead%20s03e02"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')
Also, there's no need to use .findAll if you are only looking for one element only.
I had the same problem, and none of the submitted answers really worked for me.
But after long research, I found a solution:
from requests_html import HTMLSession
s = HTMLSession()
response = s.get(url)
response.html.render()
print(response)
# prints out the content of the fully loaded page
# response can be parsed with for example bs4
The requests_html package (docs) is an official package, distributed by the Python Software Foundation. It has some additional JavaScript capabilities, like for example the ability to wait until the JS of a page has finished loading.
The package only supports Python Version 3.6 and above at the moment, so it might not work with another version.
I found a way to that !!!
r = requests.get('https://github.com', timeout=(3.05, 27))
In this, timeout has two values, first one is to set session timeout and the second one is what you need. The second one decides after how much seconds the response is sent. You can calculate the time it takes to populate and then print the data out.
Selenium is good way to solve that, but accepted answer is quite deprecated. As #Seth mentioned in comments headless mode of Firefox/Chrome (or possibly other browsers) should be used instead of PhantomJS.
First of all you need to download specific driver:
Geckodriver for Firefox
ChromeDriver for Chrome
Next you can add path to downloaded driver to system your PATH variable. But that's not necessary, you can also specify in code where executable lies.
Firefox:
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()
Similarly for Chrome:
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Chrome(options=options, executable_path='YOUR_PATH/chromedriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()
It's good to remember about browser.quit() to avoid hanging processes after code execution. If you worry that your code may fail before browser is disposed you can wrap it in try...except block and put browser.quit() in finally part to ensure it will be called.
Additionally, if part of source is still not loaded using that method, you can ask selenium to wait till specific element is present:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
try:
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
timeout_in_seconds = 10
WebDriverWait(browser, timeout_in_seconds).until(ec.presence_of_element_located((By.ID, 'resultado_busca')))
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
except TimeoutException:
print("I give up...")
finally:
browser.quit()
If you're interested in other drivers than Firefox or Chrome check docs.
In Python 3, Using the module urllib in practice works better when loading dynamic webpages than the requests module.
i.e
import urllib.request
try:
with urllib.request.urlopen(url) as response:
html = response.read().decode('utf-8')#use whatever encoding as per the webpage
except urllib.request.HTTPError as e:
if e.code==404:
print(f"{url} is not found")
elif e.code==503:
print(f'{url} base webservices are not available')
## can add authentication here
else:
print('http error',e)
Just to list my way of doing it, maybe it can be of value for someone:
max_retries = # some int
retry_delay = # some int
n = 1
ready = 0
while n < max_retries:
try:
response = requests.get('https://github.com')
if response.ok:
ready = 1
break
except requests.exceptions.RequestException:
print("Website not availabe...")
n += 1
time.sleep(retry_delay)
if ready != 1:
print("Problem")
else:
print("All good")