I was previously able to scrape data from https://www.oddschecker.com/ using BeautifulSoup, however, now all I am getting is the following 403 status:
import requests
import bs4
result = requests.get("https://www.oddschecker.com/")
result.text
Output:
<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n
I want to know if this is the same for all users on this website or if there is a way to navigate around this (via another web scraping package or other code) and access the actual data visible on the site.
Just add a user agent. It detects if your a bot by disabling js.
url = 'https://www.oddschecker.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.text)
You can also use selenium.
from selenium import webdriver
driver.get("https://www.oddschecker.com/")
print(driver.page_source)
Related
For instance, i have a script that navigated to a website and fills in some form using chrome.
If i convert this script into an executable and give it to someone (to my knowledge) they must have chrome installed already and updated etc. Lets say they don't have chrome and use Firefox, the script wouldn't work.
To avoid them having to install anything or do any extra setup, can the executable be run independently having everything necessary already there to run it?
So far I've searched on how to do this and haven't been able to find anything that suits what i need to do.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import time
options = Options()
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36')
#options.add_argument("--headless")
options.add_argument("--user-data-dir=C:\\Users\\User\\AppData\\Local\\Google\\Chrome\\User Data\\")
driver = webdriver.Chrome("C:\\Users\\User\\Downloads\\chromedriver_win32\\chromedriver.exe", options=options)
def login():
url = "https://www.instagram.com/"
driver.get(url)
time.sleep(3)
username = driver.find_element_by_name("username")
username.send_keys("testusername")
password = driver.find_element_by_name("password")
password.send_keys("testpass")
time.sleep(1)
submit = driver.find_element_by_css_selector("button.sqdOP.L3NKy.y3zKF")
submit.click()
login()
I am trying to request the html data from the web site as shown below, but it prompts following error:
'Connection aborted.', OSError("(54, 'ECONNRESET')"
I have tried to add the certificate as well, but it also prompts following error:
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
The certificate is exported from Chrome.
Python Code:
import requests
from bs4 import BeautifulSoup
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
html=requests.get(url, verify=False)
#html=requests.get(url, verify="/Users/xxx/Documents/Python/Go Daddy Root Certificate Authority - G2.cer")
Can you try this?
First of all, I didn't reproduce your environment the same way, and I tried to access the site from my PC, but it didn't work so well, so I added a user-agent to the header and it worked fine.
But I don't know if it will work on your PC.
import requests
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
html=requests.get(url,headers=headers)
print(html.text)
Some webpages I encounter have links that are generated from a javascript code and I can only access them with phantomjs as per the code below.
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"
driverpjs = webdriver.PhantomJS("/Users/xx/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs",desired_capabilities=dcap)
with contextlib.closing(driverpjs) as browser:
browser.get(link)
links = browser.find_elements_by_xpath('.//a')
How do I do this with chrome ? Right now I am trying the below:
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')
driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver", chrome_options=options)
with contextlib.closing(driver) as browser:
browser.get(link)
# GET ALL LINKS
#links = browser.find_elements_by_css_selector("a")
links = browser.find_elements_by_xpath('.//a')
To get all links on a page emulating the similar functionality of PhantomJS with Chrome using contextlib you can use the following solution:
Code Block:
from contextlib import closing
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')
driver = webdriver.Chrome(executable_path=r'C:\WebDrivers\chromedriver.exe', chrome_options=options)
with closing(driver) as browser:
browser.get("https://www.google.com/")
# get all the elements with name as q
print(browser.find_elements_by_name('q'))
Console Output:
[<selenium.webdriver.remote.webelement.WebElement (session="ab581b3b679b521ffa5bf2220f801fcf", element="0.39081088826075705-1")>]
I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()
tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.
HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?
You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)