Request Returns Response 447 - python-3.x

I'm trying to scrape a website using requests and BeautifulSoup. When i run the code to obtain the tags of the webbpage the soup object is blank. I printed out the request object to see whether the request was successful, and it was not. The printed result shows response 447. I cant find what 447 means as a HTTP Status Code. Does anyone know how I can successfully connect and scrape the site?
Code:
r = requests.get('https://foobar)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.get_text())
Output:
''
When I print request object:
print(r)
Output:
<Response [447]>

Most likely your activity is acknowledged by the site so it's blocking your access,you can fix this problem by including headers in your request to site.
import bs4
import requests
session=requests.session()
headers={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}
req=session.get(url,headers=headers)
soup=bs4.BeautifulSoup(req.text)

Sounds like they have browser detection software and they don't like your browser. (meaning they don't like your lack of a browser)
While 447 is not a standard error status for http, it is occasionally used in smtp as too many requests.
Without knowing what particular website you are looking at, it's not likely anyone will be able to give you more information. Chances are you just need to add headers.

Related

unwanted terminal http request output in Python module Azure Identity 1.7.1

I used to fetch azure client details using a python code from VS code. The output piled up with HTTP status. So I cannot able to identify the original output from the log listing status code in a loop. So how we can avoid these HTTP statuses to the terminal. (The code is running in a normal terminal. Not in a debug mode)
As you need a specific value from output, you can follow the below steps in your python code to get the content from the output response.
Import requests module and get the specific URL response to a response object, after that you can pull the content by just "requests.text".
Below are few samples on how to achieve it:
import requests
r =requests.get('URL')
r.status_code # get the status code
r.headers
r.headers['Content-Type']
r.text #get the content

Setting headers in requests.session() creates timeout issue (socket.timeout: The read operation timed out)

I am working on python 3.7 and trying to scrape data from website, below is my code
import requests
session = requests.session()
session.headers= {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46'}
base_url="any_website"
res=session.get(base_url, timeout=25)
The above code causes: socket.timeout: The read operation timed out exception
However after removing headers, the same code. works.
can anyone help me with the issue.
I'm getting a 200 response when I copy paste your code. I tried e.g. https://educk.io as a test. What site did use for your request? May it be a sever side issue after all?
(Sorry for not putting this into a comment. Seems like I did not earn the reputation for commenting, yet).
Just to make sure: I'm also on python3.7
strange but removing user-agent solved the issue.

Heroku web scraping app (usually but not always) gets 403 error on most websites

I have a web scraping app hosted by heroku that I use to scrape about 40 company web pages. 27 of them will almost always give me 403 errors on heroku, but every page works fine if I run the code locally.
After about 25 minutes of running the app and getting 403 errors (the timeframe varies a lot), all of the pages magically start working, but will return 403s again if the app restarts.
How can I prevent these 403 errors from happening at all? Relevant code as follows:
from bs4 import BeautifulSoup as soup
import urllib.request as ureq
from urllib.error import HTTPError
import time
def scraper(url):
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'
ufile0 = ureq.Request(url, headers={'User-Agent': user_agent,
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer': 'https://www.google.com/'})
try:
ufile1 = ureq.urlopen(ufile0)
except HTTPError as err:
if err.code == 504:
print('504, need to drop it right now')
return
elif err.code == 403:
print('403ed oof')
return
else:
print('unknown http error')
raise
text = ufile1.read()
ufile1.close()
psoup = soup(text, "html.parser")
while 1:
url='http://ir.nektar.com/press-releases?page=0'
scraper(url)
time.sleep(7)
I had a similar problem like you. My django web application was working fine locally but after deploying on heroku it was returning nothing. I fixed it buy using a background worker.
I found this on heroku documentation:-
One cause of request timeouts is an infinite loop in the code. Test locally (perhaps using a local copy of the production database, extracted using pgbackups) and see if you can replicate the problem and fix the bug.
Another possibility is that you are trying to do some sort of long-running task inside your web process, such as:
Sending an email
Accessing a remote API (posting to Twitter, querying Flickr, etc.)
Web scraping / crawling
Rendering an image or PDF
Heavy computation (computing a fibonacci sequence, etc.)
Heavy database usage (slow or numerous queries, N+1 queries)
If so, you should move this heavy lifting into a background job which can run asynchronously from your web request. See Worker Dynos, Background Jobs and Queueing for details.
You may need this to run behind a proxy. Fixie is available in Heroku.

PRIVACY Card API Authorization

I have recently been working with API's but I am stuck on one thing and it's been holding me back for a few days.
I am trying to work with Privacy's API and I do not understand the Authentication/Authorization process. When I enter the url in a browser I get the error "message": "Please provide API key in Authorization header", even when I use the correct format of Authorization. I also get an error when I make a request in Python. The format I'm using for the url is https://api.privacy.com/v1/card "Authorization: api-key:".
If someone could explain how to work this or simply give an example of how I would make a request through Python3. The API information is in the link below.
Thank you in advance.
https://developer.privacy.com/docs
This is the code I am using in Python. After I run this I receive a 401 status code.
import requests
headers={'Authorization': 'api-key:200e6036-6894-xxxx-xxxx-xxxx'}
url = 'https://api.privacy.com/v1/card'
r = requests.get(url)
print("Status code:", r.status_code)
You need to add the authentication header to the get call. It isn't enough to include it in a header variable. You need to provide those headers to requests
import requests
response = requests.get('https://api.privacy.com/v1/card', headers={'Authorization': 'api-key 65a9566c-XXXXXXXXXXXX'})
print(response.json())

urlopen(url) 403 Forbidden error

I'm using python to open an URL with the following code and sometimes I get this error:
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
error:'\n\n403 Forbidden\n\nForbidden\nYou don\'t have permission to access /files/2554/2554.txt\non this server.\n\nApache Server at www.gutenberg.org Port 80\n\n'
What is this?
Thank you
This is the web page blocking Python access as it is making requests with the header 'User-Agent'.
To get around this, download the 'urllib2' module and use this code:
req = urllib2.Request(url, headers ={'User-Agent':'Chrome'})
raw = urllib2.urlopen(req).read()
You are know accessing the site with the header 'Chrome' and should no longer be forbidden (I tried it myself and it worked).
Hope this helps.

Resources