I am using requests and beautifulsoup to go through the popular comic store comixology in order to make a list of all comic titles and issues and release date for all of them, so I am requesting a massive amount of web pages. Unfortunately, partway through i will get the error:
you do not have access to (URL) on this server
I tried using a function that recursively tries the request. but this isn't working
Im not putting the whole code in because it is very long.
def getUrl(url):
try:
page = requests.get(url)
except:
getUrl(url)
return page
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('http://example.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
Related
i am trying to scrape a website "https://coinatmradar.com/" . I am using requests, beautifulsoup and selenium (wherever required) to scrape data. But after a while, my ip got blocked by the website as it was using cloudflare protection.
country_url = "https://coinatmradar.com/country/226/bitcoin-atm-united-states/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response=requests.get(country_url, headers=headers)
soup=BeautifulSoup(response.content,'lxml')
This is the part of code that i am using. I am getting response 403. Is there other way around to make it work with requests and selenium both?
Try to set your headers like that:
headers = {'Cookie':'_gcar_id=0696b46733edeac962b24561ce67970199ee8668', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
I want to send http request using protobuf as params on python. I copied the protobuf data from charles proxy (web debugging proxy tool).
the protobuf text request data was:1 { 1: "2345654456765" }
i tried this but not working:
import requests
r = requests.post('https://api.website.com/version/auth/login?locale=en',data={1:{1:'2345654456765'}},headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4181.9 Safari/537.36','platform': 'web',})
print(r.content)
I have no idea of how can I put this as a param. I always worked with json data. Is there anyone who knows the solution?
i actually work in python. I need to collect the following information from this website
http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317
I tried using requests
from bs4 import BeautifulSoup
import requests
import urllib.request
r=requests.post('http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317')
print(r.url)
the output is
http://appweb2.cndh.org.mx/SNA/inicio.asp
always take the home page. i need help. thanks
The cookie has to be passed with the request to get the proper response as the server identifies the request page with cookie only.
Refer the code below.
import requests
url = 'http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317'
headers={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding':'gzip, deflate','Accept-Language':'en-US,en;q=0.9','Cache-Control':'max-age=0','Connection':'keep-alive','Cookie':'ASPSESSIONIDASTTTRTA=DIKNJIGADBLCIMCOKEMONJCH','Host':'appweb2.cndh.org.mx','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
page = requests.get(url, headers=headers).content
print(page)
As we know, the general request headers are always like "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36", or "Accept:application/json, text/plain".
But this time, I find a request header: "41d9251ae3b6e89193fe:1b237fc9847ec56e144031e03cc72d704777ef4167026f236eb3dd8d2c5b15ad837a20c8ce14459ae9f5c36e581b1322229b548178cce6cdf07ebbea7765f2df", and I have never seen a request header like that.
What does it mean?
Is there a way to log which browser/OS/etc. the user is using from my Node.js app?
Thanks
I believe you want the information stored in the Request Header "User-Agent"
var useragent = request.headers['User-Agent']
My user-agent is: "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.220 Safari/535.1" for chrome