Python requests same header like chrome but different response (show Cloudflare) - python-3.x

I'm trying to request (GET Method) with requests Library But I don't know why I get different response from this URL (udemy.com). can be problem from Certificate or cipher or Protocol?
headers = {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate',
}
req1 = requests.get('https://www.udemy.com/join/signup-popup/?displayType=ajax&display_type=popup&returnUrlAfterLogin=https&showSkipButton=1',headers=headers)
print(req1.text)
<Response [403]>

Cloudflare could detect pretty much every script, but if you use the browser itself with selenium there should be no problem.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
try:
options = Options()
options.headless = True # for background run
browser = webdriver.Firefox(options=options) # required for firefox: https://github.com/mozilla/geckodriver/releases
browser.get('https://www.udemy.com/join/signup-popup/?displayType=ajax&display_type=popup&returnUrlAfterLogin=https&showSkipButton=1')
print(browser.find_element(By.ID, "auth-to-udemy-title").text) # Prints "Sign Up and Start Learning!"
finally:
if browser:
browser.close() # avoid memory leakage
The Selenium package and for Firefox the geckodriver are required.

The site likely does not want scripts to access it, and is probably using some sort of bot detection to block them. Trying to work around this would be unethical, and possibly illegal. The ethical way to proceed is to ask the site owner for permission to use your script, and have them give you some sort of bypass token for that purpose.

Related

Web Scraping from Oddschecker using BeatifulSoup

I was previously able to scrape data from https://www.oddschecker.com/ using BeautifulSoup, however, now all I am getting is the following 403 status:
import requests
import bs4
result = requests.get("https://www.oddschecker.com/")
result.text
Output:
<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n
I want to know if this is the same for all users on this website or if there is a way to navigate around this (via another web scraping package or other code) and access the actual data visible on the site.
Just add a user agent. It detects if your a bot by disabling js.
url = 'https://www.oddschecker.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.text)
You can also use selenium.
from selenium import webdriver
driver.get("https://www.oddschecker.com/")
print(driver.page_source)

Python: Fail to request html from https site

I am trying to request the html data from the web site as shown below, but it prompts following error:
'Connection aborted.', OSError("(54, 'ECONNRESET')"
I have tried to add the certificate as well, but it also prompts following error:
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
The certificate is exported from Chrome.
Python Code:
import requests
from bs4 import BeautifulSoup
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
html=requests.get(url, verify=False)
#html=requests.get(url, verify="/Users/xxx/Documents/Python/Go Daddy Root Certificate Authority - G2.cer")
Can you try this?
First of all, I didn't reproduce your environment the same way, and I tried to access the site from my PC, but it didn't work so well, so I added a user-agent to the header and it worked fine.
But I don't know if it will work on your PC.
import requests
url ='https://www.openrice.com/zh/hongkong/restaurants/type/%E5%BF%AB%E9%A4%90%E5%BA%97?page=1'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
html=requests.get(url,headers=headers)
print(html.text)

Python Web data colecting

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()
tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

How to check if my python proxy rotator and user agent spoofing works?

I've code for Proxy IP Rotation and user agent spoofing in order to use in scraping. But because of code was provided as an example, I don't know if it really works when I add it to my code.
I am a beginner in Python. I just add it to my .py file (after the codes that is for scraping). When I add it and start scraping it works and gets all the data but I don't know if it is working or not.
Do I have to create another file for these codes (user agent spoofing and IP rotation)?
And how can I know if these are working or not when I do scraping?
Does it matter if they have defined urls?
Proxy Rotation:
from lxml.html import fromstring
import requests
from itertools import cycle
import traceback
proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080']
proxies = get_proxies()
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
for i in range(1,11):
proxy = next(proxy_pool)
print("Request #%d"%i)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
print(response.json())
except:
print("Skipping. Connnection error")
User Agent Spoofing:
import requests
import random
user_agent_list = [
#Chrome
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
#Firefox
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
url = 'https://httpbin.org/user-agent'
#Lets make 5 requests and see what user agents are used
#Using Requests
for i in range(1,6):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
#Make the request
response = requests.get(url,headers=headers)
print("Request #%d\nUser-Agent Sent:%s\nUser Agent Recevied by HTTPBin:"%(i,user_agent))
print(response.content)
print("-------------------\n\n")
If you wanted to check if your proxy and user agent are rotating, you need to go to a request bin website, activate an endpoint and use that endpoint within your python code in place of what was previously requested.
You would then examine the request bin and read what is stated for user-agent and Ip address for the Get requests now listed after executing your python code.
I would suggest running a big number of requests than try to visualize the distribution of IPs you're getting. You can easily do this in your console with a for loop and a background curl command: see
https://weautomate.org/articles/load-testing-ip-rotation-proxy/

python 3: received 403:forbidden error when using request

HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?
You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)

Resources