Scraping an ajax website using Python requests - python-3.x

I'm trying to scrape a webpage that load is content after 5 seconds.
I want to use the lib requests.
Is there something to make the request wait?
import requests
from bs4 import BeautifulSoup as soup
from time import sleep
link = 'https://www.off---white.com'
while True:
try:
r = requests.get(link, stream=False, timeout=8)
break
except:
if r.status_code == 404:
print("Client error")
r.raise_for_status()
sleep(1)
page = soup(r.text, "html.parser")
products = page.findAll('article', class_='product')
titles = page.findAll('span', class_='prod-title')[0].text.strip()
images= page.findAll('img', class_="js-scroll-gallery-snap-target")
for product in products:
print(product)

I ever answer such question but the asker give a better answer cfscrape , cfscrape is better than selenium in this website. btw the question seem to be closed i dont know why.
import cfscrape
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.off---white.com"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20180101 Firefox/47.0",
"Referer" : url
}
session = requests.session()
scraper = cfscrape.create_scraper(sess=session)
link = 'https://www.off---white.com'
r = scraper.get(link, headers=headers)
page = soup(r.text, "html.parser")
update at 15/4/2020
Since off-white updated his protection, cfscrape is not a good idea for now. plz try to use selenium.
To kind of this questions, i can not give a answer that work forever. They keep updating their protection!

No, the content that is received will be always be the same, you have to prerender it by yourself to fetch final version of the webpage.
You have to use a headless browser to execute the javascript on the webpage.
Prerender.IO offers pretty much what you need, you can check it out, the setup is pretty simple.
const prerender = require('prerender');
const server = prerender();
server.start();

Related

How to get the request headers for particular url using selenium [duplicate]

https://www.sahibinden.com/en
If you open it incognito window and check headers in Fiddler then these are the two main headers you get:
When I click the last one and check request headers this is what I get
I want to get these headers in Python. Is there any way that I can get these using Selenium? Im a bit clueless here.
You can use Selenium Wire. It is a Selenium extension which has been developed for this exact purpose.
https://pypi.org/project/selenium-wire/
An example after pip install:
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can run JS command like this;
var req = new XMLHttpRequest()
req.open('GET', document.location, false)
req.send(null)
return req.getAllResponseHeaders()
On Python;
driver.get("https://t.me/codeksiyon")
headers = driver.execute_script("var req = new XMLHttpRequest();req.open('GET', document.location, false);req.send(null);return req.getAllResponseHeaders()")
# type(headers) == str
headers = headers.splitlines()
The bottom line is, No, you can't retrieve the request headers using Selenium.
Details
It had been a long time demand from the Selenium users to add the WebDriver methods to read the HTTP status code and headers from a HTTP response. We have discussed about implementing this feature through Selenium at length within the discussion WebDriver lacks HTTP response header and status code methods.
However, Jason Leyba (Selenium contributor) in his comment straightly mentioned:
We will not be adding this feature to the WebDriver API as it falls outside of our current scope (emulating user actions).
Ashley Leyba further added, attempting to make WebDriver the ideal web testing tool will suffer in overall quality as driver.get(url) blocks until the browser has loaded the page and return the response for the final loaded page. So in case of a login redirects, status codes and headers will always end up with a 200 instead of the 302 you're looking for.
Finally, Simon M Stewart (WebDriver creator) in his comment concluded that:
This feature isn't going to happen. The recommended approach is to either extend the HtmlUnitDriver to access the information you require or to make use of an external proxy that exposes this information such as the BrowserMob Proxy
It's not possible to get headers using Selenium. Further information
However, you might use other libraries such as requests, BeautifulSoup to get headers.
Maybe you can use BrowserMob Proxy for this. Here is a example:
import settings
from browsermobproxy import Server
from selenium.webdriver import DesiredCapabilities
config = settings.Config
server = Server(config.BROWSERMOB_PATH)
server.start()
proxy = server.create_proxy()
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % proxy.proxy)
chrome_options.add_argument('--headless')
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
driver = webdriver.Chrome(options=chrome_options,
desired_capabilities=capabilities,
executable_path=config.CHROME_PATH)
proxy.new_har("sahibinden", options={'captureHeaders': True})
driver.get("https://www.sahibinden.com/en")
entries = proxy.har['log']["entries"]
for entry in entries:
if 'request' in entry.keys():
print(entry['request']['url'])
print(entry['request']['headers'])
print('\n')
proxy.close()
driver.quit()
js_headers = '''
const _xhr = new XMLHttpRequest();
_xhr.open("HEAD", document.location, false);
_xhr.send(null);
const _headers = {};
_xhr.getAllResponseHeaders().trim().split(/[\\r\\n]+/).map((value) => value.split(/: /)).forEach((keyValue) => {
_headers[keyValue[0].trim()] = keyValue[1].trim();
});
return _headers;
'''
page_headers = driver.execute_script(js_headers)
type(page_headers) # -> dict
You can use https://pypi.org/project/selenium-wire/ a plug-in replacement for webdriver adding request/response manipulation even for https by using its own local ssl certificate.
from seleniumwire import webdriver
d = webdriver.Chrome() # make sure chrome/chromedriver is in path
d.get('https://en.wikipedia.org')
vars(d.requests[-1].headers)
will list the headers in the last requests object list:
{'policy': Compat32(), '_headers': [('content-length', '1361'),
('content-type', 'application/json'), ('sec-fetch-site', 'none'),
('sec-fetch-mode', 'no-cors'), ('sec-fetch-dest', 'empty'),
('user-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36'),
('accept-encoding', 'gzip, deflate, br')],
'_unixfrom': None, '_payload': None, '_charset': None,
'preamble': None, 'epilogue': None, 'defects': [], '_default_type': 'text/plain'}

Download Specific file from Website with BeautifulSoup

Following the documentation of BeautifulSoup, I am trying to download a specific file from a webpage. First trying to find the link that contains the file name:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find("a", text=re.compile("TASA_DOLAR_REFERENCIA_MC.xls"))
path = link.get('href')
print(f"{path}")
But with no success. Then trying to print every link on that page, I get no links:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find_all('a')
for links in parsed.find_all("a href"):
print(links.get('a href'))
It looks like the url of the file is dynamic, it adds a ?v=123456789 parameter to the end of the url, like the file version, that's why I need to download the file using the file name.
(Eg https://cdn.bancentral.gov.do/documents/estadisticas/mercado-cambiario/documents/TASA_DOLAR_REFERENCIA_MC.xls?v=1612902983415)
Thanks.
Actually you are dealing with a dynamic JavaScript page which is fully loaded via an XHR request to the following url once the page loads.
Below is a direct call to the back-end API which identify the request using page id which is 2538 and then we can load your desired url.
import requests
from bs4 import BeautifulSoup
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0'
}
with requests.Session() as req:
req.headers.update(headers)
data = {
"id": "2538",
"languageName": "es"
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.json()['result']['article']['content'], 'lxml')
target = soup.select_one('a[href*=TASA_DOLAR_REFERENCIA_MC]')['href']
r = req.get(target)
with open('data.xls', 'wb') as f:
f.write(r.content)
if __name__ == "__main__":
main('https://www.bancentral.gov.do/Home/GetContentForRender')

I want to open the first link that appear when i do a search on google

I want to get the first link from the html parser, but I'm getting anything(tried to print).
Also when i inspect the page on browser, the links are under class='r'
But when i print the soup.prettify(), and closely analyse then i find there is no class='r', instead class="BNeawe UPmit AP7Wnd".
Please help, thanks in advance!
import requests
import sys
import bs4
import webbrowser
def open_web(query):
res = requests.get('https://google.com/search?q=' + query)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
link_to_open = min(1, len(link_elements))
for i in range(link_to_open):
webbrowser.open('https://google.com' + link_elements[i].get('href'))
open_web('youtube')
The problem is that google serves different HTML when you don't specify User-Agent in headers. To add User-Agent to your request, put it in the headers= attribute:
import requests
import bs4
def open_web(query):
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('https://google.com/search?q=' + query, headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
print(link_elements)
open_web('youtube')
Prints:
[<a href="https://www.youtube.com/?gl=EE&hl=et" onmousedown="return rwt(this,'','','','1','AOvVaw2lWnw7oOhIzXdoFGYhvwv_','','2ahUKEwjove3h7onkAhXmkYsKHbWPAUYQFjAAegQIBhAC','','',event)"><h3 class="LC20lb">
... and so on.
You received a completely different HTML with different elements and selectors thus the output is empty. The reason why Google blocks your request is because default requests user-agent is python-requests and Google understands it and blocks it. Check what's your user-agent.
User-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Sometimes you can receive a different HTML, with different selectors.
You can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
If you want to get the very first link then use select_one() instead.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = soup.select_one('.yuRUbf a')['href']
print(link)
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means first index of search results
link = results['organic_results'][0]['link']
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Disclaimer, I work for SerpApi.

404 error using urllib but URL works fine in browser AND the full web page is returned in the error

I'm trying to open a web page in python using urllib ( to scrape it ). The web page looks fine in a browser but I get a 404 error with urlopen. However, if look at the text returned with the error, it actually has the full web page in it.
from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
try:
html = urlopen('http://www.enduroworldseries.com/series-rankings')
except HTTPError as e:
err = e.read()
code = e.getcode()
print(err)
When I run the code, the exception is caught and 'code' is '404'. The err variable has the complete html that shows up if you look at the page in a browser. So why am I getting an error?
Not sure if it matters but other pages on the same domain load fine with urlopen.
I found a solution without knowing what the initial problem was. Simply replaced urllib with the requests library.
req = Request('http://www.enduroworldseries.com/series-rankings', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})
html = urlopen(req)
bsObj = BeautifulSoup(html, "html.parser")
Became
response = requests.get('http://www.enduroworldseries.com/series-rankings', {'User-Agent': 'Mozilla/5.0'})
bsObj = BeautifulSoup(response.content, "html.parser")

soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

I am using BeautifulSoup to extract all links from google search results page.
here's the snippet of the code:
import requests,bs4
res = requests.get('https://www.google.com/#q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')
But soup.select('.r a') is returning an empty list
Thanks
That's because of the url you are using:
https://www.google.com/#q=vigilante+mic
Is a javascript version of the search. If you curl it you will see there are no answers in the html. This happens because the results are fetched through javascript and requests doesn't handle that.
Try this other url (that is not javascript based):
https://www.google.com/search?q=vigilante+mic
Now it works:
import requests,bs4
res = requests.get('https://www.google.com/search?q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')
Besides changing #q= to ?q=, one of the reasons it's empty is because there's no user-agent specified thus Google blocks your request. What is my user-agent?
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you only need to iterate over JSON string without figuring out how to extract something or find CSS that works.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "cyber security",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
-----------------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
P.S - I wrote a bit more detailed blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Resources