from requests import get
get('http://www.fb.com')
<Response [200]>
get('http://www.subscene.com')
<Response [403]
I'm trying to build a web scraper to scrape and download subtitles. But I'm unable to request any subtitle pages as they are returning a response code 403.
HTTP Status Code 403 Forbidden means:
the server understood the request, but is refusing to fulfill it. Source
The server identified your script as a non-default browser (Chrome, Firefox, etc.) and is refusing to "speak" with it. It's very common to see sites doing this to avoid scrapers, exactly what you're trying to do...
A workaround is to set a user-agent in your headers, like so:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
url = "http://www.subscene.com"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response) # <Response [200]>
But I advise you to look for a site that provides some sort of API, relying on scraping isn't the best approach.
Related
I'm using node-fetch and https-proxy-agent to make a request using a proxy, however, I get a 400 error code from the site I'm scraping only when I send the agent, without it, everything works fine.
import fetch from 'node-fetch';
import Proxy from 'https-proxy-agent';
const ip = PROXIES[Math.floor(Math.random() * PROXIES.length)]; // PROXIES is a list of ips
const proxyAgent = Proxy(`http://${ip}`);
fetch(url, {
agent: proxyAgent,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.72 Safari/537.36'
}
}).then(res => res.text()).then(console.log)
This results in a 400 error code like so:
I have absolutely no idea why this is happening. If you want to reproduce the issue, I'm scraping https://azlyrics.com. Please let me know what is wrong.
The issue has been fixed. I did not notice I was making a request to a https site with a http proxy. The site was using https protocol but the proxies were http only. Changing to https proxies works. Thank you.
i am trying to scrape a website "https://coinatmradar.com/" . I am using requests, beautifulsoup and selenium (wherever required) to scrape data. But after a while, my ip got blocked by the website as it was using cloudflare protection.
country_url = "https://coinatmradar.com/country/226/bitcoin-atm-united-states/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response=requests.get(country_url, headers=headers)
soup=BeautifulSoup(response.content,'lxml')
This is the part of code that i am using. I am getting response 403. Is there other way around to make it work with requests and selenium both?
Try to set your headers like that:
headers = {'Cookie':'_gcar_id=0696b46733edeac962b24561ce67970199ee8668', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
I am trying to scrape a website content and getting 403 Forbidden status. I have tried solutions like using sessions for cookies and mocking browser through a 'User-Agent' header. Here is the code I have been using
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
}
page = session.get('https://www.sizeofficial.nl/product/zwart-new-balance-992/343646_sizenl/', headers = headers)
Note that this approach works on other websites, it is just this one which does not seem to work. I have even tried using other headers which my browser is sending them, and it does not seem to work. Another approach I have tried is to first create a session cookie and then pass that cookie to session.get, still doesn't work for me. Is it not allowed to scrape the website or am I still missing something?
I am using python 3.8 requests to achieve this purpose.
i actually work in python. I need to collect the following information from this website
http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317
I tried using requests
from bs4 import BeautifulSoup
import requests
import urllib.request
r=requests.post('http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317')
print(r.url)
the output is
http://appweb2.cndh.org.mx/SNA/inicio.asp
always take the home page. i need help. thanks
The cookie has to be passed with the request to get the proper response as the server identifies the request page with cookie only.
Refer the code below.
import requests
url = 'http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317'
headers={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding':'gzip, deflate','Accept-Language':'en-US,en;q=0.9','Cache-Control':'max-age=0','Connection':'keep-alive','Cookie':'ASPSESSIONIDASTTTRTA=DIKNJIGADBLCIMCOKEMONJCH','Host':'appweb2.cndh.org.mx','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
page = requests.get(url, headers=headers).content
print(page)
I want to scrape data from a website; however I keep getting the HTTP: Error 405: Not Allowed. What am I doing wrong?
(I have looked at the documentation, and tried their code, with only my url in place of the example's; I still have the same error.)
Here's the code:
import requests, urllib
from urllib.request import Request, urlopen
list_url= ["http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm"]
for url in list_url:
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response=urllib.request.urlopen(req).read()
If I skip the user-agent term, I get HTTP Error 403: Forbidden.
In the past, I have successfully scraped data (from another website) using the following:
for url in list_url:
raw_html = urllib.request.urlopen(url).read()
soup=None
soup = BeautifulSoup(raw_html,"lxml")
Ideally, I would like to keep a similar structure, that is, pass the content of the fetched url to BeautifulSoup.
Thanks!
The error you are getting is "Pardon our Interruption. something about your browser made us think you were a bot". Implies scraping ain't permitted and they have anti-scraping bots on their webpages.
Try using a fake-browser. Link to how to make requests using a fake-browser. (How to use Python requests to fake a browser visit? )
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm'
web_page = requests.get(url,headers=headers)
I tried this and what I found is their page is getting loaded via JS. So I think you might want to use a headless Browser ( Selenium / PhantomJS ) and scrape rendered html pages. Hope it helps.
Not sure about exactly reason of the issue, but try this code it is working for me:
import http.client
connection = http.client.HTTPSConnection("www.glassdoor.com")
connection.request("GET", "/Reviews/WhiteWave-Reviews-E9768.htm")
res = connection.getresponse()
data = res.read()