python 3: received 403:forbidden error when using request - python-3.x

HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?

You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)

Related

Trackjs: ignore rules by token in user agent

In TrackJS, some user agents are parsed as normal browsers, e.g.:
Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
Chrome Mobile 59.0.3071
I tried to do it by ignore rules in settings, but it doesn't work.
So I need to filtrate errors by token in user agent.
Is it possible do this without JS?
More similar user agents: https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers
The TrackJS UI doesn't allow you to create Ignore Rules against the raw UserAgent, only the parsed browser and operating system. Instead, use the client-side ignore capability with the onError function.
Build your function to detect the tokens you want to exclude, and return false from the function if you don't want it to be sent.

Web Scraping from Oddschecker using BeatifulSoup

I was previously able to scrape data from https://www.oddschecker.com/ using BeautifulSoup, however, now all I am getting is the following 403 status:
import requests
import bs4
result = requests.get("https://www.oddschecker.com/")
result.text
Output:
<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n
I want to know if this is the same for all users on this website or if there is a way to navigate around this (via another web scraping package or other code) and access the actual data visible on the site.
Just add a user agent. It detects if your a bot by disabling js.
url = 'https://www.oddschecker.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.text)
You can also use selenium.
from selenium import webdriver
driver.get("https://www.oddschecker.com/")
print(driver.page_source)

Python: Fail to request html from https site

I am trying to request the html data from the web site as shown below, but it prompts following error:
'Connection aborted.', OSError("(54, 'ECONNRESET')"
I have tried to add the certificate as well, but it also prompts following error:
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
The certificate is exported from Chrome.
Python Code:
import requests
from bs4 import BeautifulSoup
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
html=requests.get(url, verify=False)
#html=requests.get(url, verify="/Users/xxx/Documents/Python/Go Daddy Root Certificate Authority - G2.cer")
Can you try this?
First of all, I didn't reproduce your environment the same way, and I tried to access the site from my PC, but it didn't work so well, so I added a user-agent to the header and it worked fine.
But I don't know if it will work on your PC.
import requests
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
html=requests.get(url,headers=headers)
print(html.text)

Python Web data colecting

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()
tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

Getting content from page which checks for js

I am using "request" module to get page contents with following headers
var headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0' };
still, the page I am trying to fetch somehow displays different content than View > Source from browser (looks like it detects for javascript support) , before diving into phantomjs (which I want to avoid due performance limitations) is there any way to get the html as it is on the browser?.
Thanks

Resources