Being Blocked when i want to scrape a website - python-3.x

I am trying to scrape a website, but I had the problem of 403 forbidden (that means they blocked me), how can I solve this problem?
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(options=options)
#url: the website that i wanna scrape
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup)
I got this error message :
<pre><html><head><title>You have been blocked</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><script async="" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&amp;ns=1&amp;cb=749975105" type="text/javascript"></script><script>var dd={&apos;cid&apos;:&apos;AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ==&apos;,&apos;hsh&apos;:&apos;53505CB4534F4422CC81E4A9499234&apos;,&apos;t&apos;:&apos;fe&apos;}</script><script src="https://ct.datado.me/c.js"></script><iframe border="0" frameborder="0" height="100%" scrolling="yes" src="https://c.datado.me/captcha/?initialCid=AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ%3D%3D&amp;hash=53505CB4534F4422CC81E4A9499234&amp;cid=09ccOuPGIGlqdUvFNJgB7GzPDCFBmdMIU8Ng~E~1M6.&amp;t=fe " style="height:100vh;" width="100%"></iframe><script type="text/javascript">
//<![CDATA[
(function() {
var _analytics_scr = document.createElement(&apos;script&apos;);
_analytics_scr.type = &apos;text/javascript&apos;; _analytics_scr.async = true; _analytics_scr.src = &apos;/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=749975105&apos;;
var _analytics_elem = document.getElementsByTagName(&apos;script&apos;)[0]; _analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]>
</script>
</body></html>
</pre>

403 Forbidden
The HTTP 403 Forbidden client error status response code indicates that the server have recieved the request but the client is not authorized and does not have access rights to the content.
This status is similar to 401, but in this case, re-authenticating will make no difference. The access is permanently forbidden and tied to the application logic, such as insufficient rights to a resource.
Example response
HTTP/1.1 403 Forbidden
Date: Sun, 16 June 2019 07:28:00 GMT
Reason
There are a lot many ways for the headless Chrome browser to get detected and some of the main factors includes:
User agent
Plugins
Languages
WebGL
Browser features
Missing image
You can find a detailed discussion in Selenium and non-headless browser keeps asking for Captcha
Solution
A generic solution will be to use a proxy or rotating proxies from the Free Proxy List.
You can find a detailed discussion in Change proxy in chromedriver for scraping purposes

Related

WebScraping HTTP 403/No Content

I've been trying to learn webscraping and reading information from this website https://parade.com/936820/parade/good-morning-quotes/.
I was able to read in the info a few days ago using:
URL = "https://parade.com/936820/parade/good-morning-quotes/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
, but now I only get:
<html><head><title>parade.com</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMA9vD7hgJkQVEAPUSGag==','hsh':'2AC20A4365547ED96AE26618B66966','t':'fe','s':41010,'e':'0d3bf75fe73f118f69c86dfcd13d3a3bb5f4e364f75eb6dc0c26a7b793e181dc','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I've tried adding a header file but that gives me a HTTP 403 error.
I've tried adding a header file but that gives me a HTTP 403 error.

How to catch exception info for the wrong url when to open it with selenium?

When you input http://www.python.or/ (intentionally use the wrong url) in firefox or other browsers, browser show something such as below:
The connection was reset
The connection to the server was reset while the page was loading.
The site could be temporarily unavailable or too busy. Try again in a few moments.
If you are unable to load any pages, check your computer’s network connection.
If your computer or network is protected by a firewall or proxy, make sure that Firefox is permitted to access the
Now let's do the same task with selenium.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("http://www.python.or")
When you execute the above code in console,no error info,how to catch the exception such as firefox do with selenium?
I suggest you to try HTML requests
First, get the requests from the URL:
import requests
r = requests.get(your_url)
Now you need to get the status_code from your request:
print(r.status_code)
r.status_code contains the http's error handler.
Here is a list of all the HTTP status codes
Try this and you'll know when something wrong happened (ex Error 404)

Web page doesn't load when opened through a bot or requested through any other method in python code

I am trying to scrape https://www.hyatt.com and It not for illegal use I just want to make a simple script to find Hotel which matches my search.
But the problem is I am unable to even load the webpage using any bot. It simply does not load.
here are some ways I already tried.
1 - Used selenium
2 - used scrapy frame-work to get the data
3 - used python requests library
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.hyatt.com")
driver.close()
I just want that the page loads itself. I will take care of the rest.
I took your code added a few tweaks and ran the same test at my end:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.hyatt.com")
WebDriverWait(driver, 20).until(EC.title_contains("Hyatt"))
print(driver.title)
driver.quit()
Eventually I ran into the same issue. Using Selenium I was also unable to even load the webpage. But when I inspected the Console Errors within google-chrome-devtools it clearly showed that:
Failed to load resource: the server responded with a status of 404 () https://www.hyatt.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint
Snapshot:
404 Not Found
The HTTP 404 Not Found client error response code indicates that the server can't find the requested resource. Links which lead to a 404 page are often called broken or dead links, and can be subject to link rot.
A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, ideally a 410 (Gone) should be used instead of a 404 status.
Moving ahead, while inspecting the HTML DOM of https://www.hyatt.com/ it was observed that some of the <script> and <noscript> tags refers to akam:
<script type="text/javascript" src="https://www.hyatt.com/akam/10/28f56097" defer=""></script>
<noscript><img src="https://www.hyatt.com/akam/10/pixel_28f56097?a=dD02NDllZTZmNzg1NmNmYmIyYjVmOGFiOGYwMWI5YWMwZmM4MzcyZGY5JmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>
Which is a clear indication that the website is protected by Bot Management service provider Akamai Bot Manager and the navigation by WebDriver driven Browser Client gets detected and subsequently gets blocked.
Outro
You can find some more relevant discussions in:
Unable to use Selenium to automate Chase site login
How does recaptcha 3 know I'm using selenium/chromedriver?
Selenium and non-headless browser keeps asking for Captcha

Request Returns Response 447

I'm trying to scrape a website using requests and BeautifulSoup. When i run the code to obtain the tags of the webbpage the soup object is blank. I printed out the request object to see whether the request was successful, and it was not. The printed result shows response 447. I cant find what 447 means as a HTTP Status Code. Does anyone know how I can successfully connect and scrape the site?
Code:
r = requests.get('https://foobar)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.get_text())
Output:
''
When I print request object:
print(r)
Output:
<Response [447]>
Most likely your activity is acknowledged by the site so it's blocking your access,you can fix this problem by including headers in your request to site.
import bs4
import requests
session=requests.session()
headers={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}
req=session.get(url,headers=headers)
soup=bs4.BeautifulSoup(req.text)
Sounds like they have browser detection software and they don't like your browser. (meaning they don't like your lack of a browser)
While 447 is not a standard error status for http, it is occasionally used in smtp as too many requests.
Without knowing what particular website you are looking at, it's not likely anyone will be able to give you more information. Chances are you just need to add headers.

urlopen(url) 403 Forbidden error

I'm using python to open an URL with the following code and sometimes I get this error:
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
error:'\n\n403 Forbidden\n\nForbidden\nYou don\'t have permission to access /files/2554/2554.txt\non this server.\n\nApache Server at www.gutenberg.org Port 80\n\n'
What is this?
Thank you
This is the web page blocking Python access as it is making requests with the header 'User-Agent'.
To get around this, download the 'urllib2' module and use this code:
req = urllib2.Request(url, headers ={'User-Agent':'Chrome'})
raw = urllib2.urlopen(req).read()
You are know accessing the site with the header 'Chrome' and should no longer be forbidden (I tried it myself and it worked).
Hope this helps.

Resources