I've been trying to learn webscraping and reading information from this website https://parade.com/936820/parade/good-morning-quotes/.
I was able to read in the info a few days ago using:
URL = "https://parade.com/936820/parade/good-morning-quotes/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
, but now I only get:
<html><head><title>parade.com</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMA9vD7hgJkQVEAPUSGag==','hsh':'2AC20A4365547ED96AE26618B66966','t':'fe','s':41010,'e':'0d3bf75fe73f118f69c86dfcd13d3a3bb5f4e364f75eb6dc0c26a7b793e181dc','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I've tried adding a header file but that gives me a HTTP 403 error.
I've tried adding a header file but that gives me a HTTP 403 error.
Related
How to scraping the link such as http://bitly.is/heretohelp in tweet and then open the link automatically?
Tweepy gives you access to the Tweet text and conveniently exposing the urls, hashtags, for example to get the URLs within a Tweet:
tweet = api.get_status(id='000001')
print(tweet.entities['urls'])
for url in tweet.entities['urls']:
# t.co url
print(url['url'])
# original url
print(url['expanded_url'])
Once you have the URLs you can decide to do what you want (scrape the target url or open in a browser tab - if you have a web app for example)
I am trying to make a web scraper. I would like to get the destination URL from a query URL. But it redirects many times.
This is my URL:
https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO
Destination url should be:
https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO
But I am getting https://www.jw.org/ins/library/videos/?item=pub-jwb_201812_16_VIDEO&appLanguage=INS this as the redirected URL.
I tried this code:
import requests
url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'
s = requests.get(url)
print(s.url)
The redirect is made using JavaScript
It is not a server redirect so requests is not following it.
You can get the URL using Selenium
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'
browser.get(url)
time.sleep(5)
print (browser.current_url)
browser.quit()
Outputs
https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO
If you are building a scraper I would suggest you check out scrapy-splash https://github.com/scrapy-plugins/scrapy-splash or requests-html https://github.com/psf/requests-html
You can do this pretty easily using requests:
import requests
destination = requests.get("http://doi.org/10.1080/07435800.2020.1713802")
#this link redirects the user to another link with a research paper of a given DOI code
print(destination.url)
#this returns "https://www.tandfonline.com/doi/full/10.1080/07435800.2020.1713802", the redirect of the initial doi.org link
I am trying to scrape a website, but I had the problem of 403 forbidden (that means they blocked me), how can I solve this problem?
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(options=options)
#url: the website that i wanna scrape
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup)
I got this error message :
<pre><html><head><title>You have been blocked</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><script async="" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=749975105" type="text/javascript"></script><script>var dd={'cid':'AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ==','hsh':'53505CB4534F4422CC81E4A9499234','t':'fe'}</script><script src="https://ct.datado.me/c.js"></script><iframe border="0" frameborder="0" height="100%" scrolling="yes" src="https://c.datado.me/captcha/?initialCid=AHrlqAAAAAMAcW1trsuCoDEAXu-3KQ%3D%3D&hash=53505CB4534F4422CC81E4A9499234&cid=09ccOuPGIGlqdUvFNJgB7GzPDCFBmdMIU8Ng~E~1M6.&t=fe " style="height:100vh;" width="100%"></iframe><script type="text/javascript">
//<![CDATA[
(function() {
var _analytics_scr = document.createElement('script');
_analytics_scr.type = 'text/javascript'; _analytics_scr.async = true; _analytics_scr.src = '/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=749975105';
var _analytics_elem = document.getElementsByTagName('script')[0]; _analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]>
</script>
</body></html>
</pre>
403 Forbidden
The HTTP 403 Forbidden client error status response code indicates that the server have recieved the request but the client is not authorized and does not have access rights to the content.
This status is similar to 401, but in this case, re-authenticating will make no difference. The access is permanently forbidden and tied to the application logic, such as insufficient rights to a resource.
Example response
HTTP/1.1 403 Forbidden
Date: Sun, 16 June 2019 07:28:00 GMT
Reason
There are a lot many ways for the headless Chrome browser to get detected and some of the main factors includes:
User agent
Plugins
Languages
WebGL
Browser features
Missing image
You can find a detailed discussion in Selenium and non-headless browser keeps asking for Captcha
Solution
A generic solution will be to use a proxy or rotating proxies from the Free Proxy List.
You can find a detailed discussion in Change proxy in chromedriver for scraping purposes
I'm trying to scrape a website using requests and BeautifulSoup. When i run the code to obtain the tags of the webbpage the soup object is blank. I printed out the request object to see whether the request was successful, and it was not. The printed result shows response 447. I cant find what 447 means as a HTTP Status Code. Does anyone know how I can successfully connect and scrape the site?
Code:
r = requests.get('https://foobar)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.get_text())
Output:
''
When I print request object:
print(r)
Output:
<Response [447]>
Most likely your activity is acknowledged by the site so it's blocking your access,you can fix this problem by including headers in your request to site.
import bs4
import requests
session=requests.session()
headers={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}
req=session.get(url,headers=headers)
soup=bs4.BeautifulSoup(req.text)
Sounds like they have browser detection software and they don't like your browser. (meaning they don't like your lack of a browser)
While 447 is not a standard error status for http, it is occasionally used in smtp as too many requests.
Without knowing what particular website you are looking at, it's not likely anyone will be able to give you more information. Chances are you just need to add headers.
I'm using python to open an URL with the following code and sometimes I get this error:
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
error:'\n\n403 Forbidden\n\nForbidden\nYou don\'t have permission to access /files/2554/2554.txt\non this server.\n\nApache Server at www.gutenberg.org Port 80\n\n'
What is this?
Thank you
This is the web page blocking Python access as it is making requests with the header 'User-Agent'.
To get around this, download the 'urllib2' module and use this code:
req = urllib2.Request(url, headers ={'User-Agent':'Chrome'})
raw = urllib2.urlopen(req).read()
You are know accessing the site with the header 'Chrome' and should no longer be forbidden (I tried it myself and it worked).
Hope this helps.