I am trying to scrape a website content and getting 403 Forbidden status. I have tried solutions like using sessions for cookies and mocking browser through a 'User-Agent' header. Here is the code I have been using
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
}
page = session.get('https://www.sizeofficial.nl/product/zwart-new-balance-992/343646_sizenl/', headers = headers)
Note that this approach works on other websites, it is just this one which does not seem to work. I have even tried using other headers which my browser is sending them, and it does not seem to work. Another approach I have tried is to first create a session cookie and then pass that cookie to session.get, still doesn't work for me. Is it not allowed to scrape the website or am I still missing something?
I am using python 3.8 requests to achieve this purpose.
Related
I'm using node-fetch and https-proxy-agent to make a request using a proxy, however, I get a 400 error code from the site I'm scraping only when I send the agent, without it, everything works fine.
import fetch from 'node-fetch';
import Proxy from 'https-proxy-agent';
const ip = PROXIES[Math.floor(Math.random() * PROXIES.length)]; // PROXIES is a list of ips
const proxyAgent = Proxy(`http://${ip}`);
fetch(url, {
agent: proxyAgent,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.72 Safari/537.36'
}
}).then(res => res.text()).then(console.log)
This results in a 400 error code like so:
I have absolutely no idea why this is happening. If you want to reproduce the issue, I'm scraping https://azlyrics.com. Please let me know what is wrong.
The issue has been fixed. I did not notice I was making a request to a https site with a http proxy. The site was using https protocol but the proxies were http only. Changing to https proxies works. Thank you.
I am using axios and a API (cowin api https://apisetu.gov.in/public/marketplace/api/cowin/cowin-public-v2) which has strong kind of protection against the web requests.
When I was getting error 403 on my dev machine (Windows) then, I solve it by just adding a header 'User-Agent'.
When I have deployed it to heroku I am still getting the same error.
const { data } = await axios.get(url, {
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
},
})
Using a fake user-agent in your headers can help with this problem, but there are other variables you may want to consider.
For example, if you are making multiple HTTP requests you may want to have multiple fake user-agents to and then randomize the user-agent for every request made. This can help limit the changes of your scraper being detected.
If that still doesn't work you may want to consider optimizing your headers further. Other than sending HTTP requests with a randomized user-agent, you can further imitate a browser's request Headers by adding more Headers than just the "user-agent"- then ensuring that the user-agent that is selected is consistent with the information sent from the rest of the headers.
You can check out here for more information.
On the site it will not only provide information on how to optimize your headers consistently with the user-agent, but also provide more solutions in case the above mentioned still was unsuccessful.
In my situation, it was the case that I had to bypass cloudflare. You can determine if this is your situation as well if you log your error to the terminal and then check if under the "server" key it says "cloudflare". In which case you can use this documentation for further assistance.
from requests import get
get('http://www.fb.com')
<Response [200]>
get('http://www.subscene.com')
<Response [403]
I'm trying to build a web scraper to scrape and download subtitles. But I'm unable to request any subtitle pages as they are returning a response code 403.
HTTP Status Code 403 Forbidden means:
the server understood the request, but is refusing to fulfill it. Source
The server identified your script as a non-default browser (Chrome, Firefox, etc.) and is refusing to "speak" with it. It's very common to see sites doing this to avoid scrapers, exactly what you're trying to do...
A workaround is to set a user-agent in your headers, like so:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
url = "http://www.subscene.com"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response) # <Response [200]>
But I advise you to look for a site that provides some sort of API, relying on scraping isn't the best approach.
I want to scrape data from a website; however I keep getting the HTTP: Error 405: Not Allowed. What am I doing wrong?
(I have looked at the documentation, and tried their code, with only my url in place of the example's; I still have the same error.)
Here's the code:
import requests, urllib
from urllib.request import Request, urlopen
list_url= ["http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm"]
for url in list_url:
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response=urllib.request.urlopen(req).read()
If I skip the user-agent term, I get HTTP Error 403: Forbidden.
In the past, I have successfully scraped data (from another website) using the following:
for url in list_url:
raw_html = urllib.request.urlopen(url).read()
soup=None
soup = BeautifulSoup(raw_html,"lxml")
Ideally, I would like to keep a similar structure, that is, pass the content of the fetched url to BeautifulSoup.
Thanks!
The error you are getting is "Pardon our Interruption. something about your browser made us think you were a bot". Implies scraping ain't permitted and they have anti-scraping bots on their webpages.
Try using a fake-browser. Link to how to make requests using a fake-browser. (How to use Python requests to fake a browser visit? )
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm'
web_page = requests.get(url,headers=headers)
I tried this and what I found is their page is getting loaded via JS. So I think you might want to use a headless Browser ( Selenium / PhantomJS ) and scrape rendered html pages. Hope it helps.
Not sure about exactly reason of the issue, but try this code it is working for me:
import http.client
connection = http.client.HTTPSConnection("www.glassdoor.com")
connection.request("GET", "/Reviews/WhiteWave-Reviews-E9768.htm")
res = connection.getresponse()
data = res.read()
I'm wondering to know how to using cookie with request (https://github.com/mikeal/request)
I need to set a cookie which able to be fetched for every sub domains from request,
something like
*.examples.com
and the path is for every page, something like
/
then server-side able to fetch the data from cookie correctly, something like
test=1234
I found the cookies which setup from response was working fine,
I added a custom jar to save the cookies, something like
var theJar = request.jar();
var theRequest = request.defaults({
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36'
}
, jar: theJar
});
but the cookies which I setup from request, only able to be fetched in same domain,
and I can't find a method to setup cookie in more options
for now if I want one cookie which able to be fetched in three sub domains,
I have to setup like this way:
theJar.setCookie('test=1234', 'http://www.examples.com/', {"ignoreError":true});
theJar.setCookie('test=1234', 'http://member.examples.com/', {"ignoreError":true});
theJar.setCookie('test=1234', 'http://api.examples.com/', {"ignoreError":true});
Is here any advance ways to setup a cookie from request,
made it able to be fetched in every sub domains ???
I just found the solution ....
theJar.setCookie('test=1234; path=/; domain=examples.com', 'http://examples.com/');
hm...I have to say, the document which for request is not so good..., lol