Scraping ajax xmlhttprequest using python

Scraping ajax xmlhttprequest using python - python-3.x

I want to scrape school name, address, phone, email in UK from https://www.isc.co.uk/schools/ this website using xmlhttpeequest. It returns error 500.
import requests
headers = {"Accept": "application/json, text/plain, */*","Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Type": "application/json;charset=UTF-8",
"Cookie": "_ga=GA1.3.1302518161.1584461820; _hjid=5f23e8d2-23c6-4c87-9cc0-ca216b587ae1; cookie_preference=false; iscFilterStates=%7B%22locationLatitude%22%3Anull%2C%22locationLongitude%22%3Anull%2C%22distanceInMiles%22%3A0%2C%22residencyTypes%22%3A%5B%5D%2C%22genderGroup%22%3Anull%2C%22ageRange%22%3Anull%2C%22religiousAffiliation%22%3Anull%2C%22financialAssistances%22%3A%5B%5D%2C%22examinations%22%3A%5B%5D%2C%22specialNeeds%22%3Afalse%2C%22scholarshipsAndBurseries%22%3Afalse%2C%22latitudeSW%22%3A47.823214345168694%2C%22longitudeSW%22%3A-18.049563984375%2C%22latitudeNE%22%3A59.385618287793505%2C%22longitudeNE%22%3A12.953853984375021%2C%22contactCountyID%22%3A0%2C%22contactCountryID%22%3A0%2C%22londonBoroughID%22%3A0%2C%22filterByBounds%22%3Atrue%2C%22savedBounds%22%3Atrue%2C%22zoom%22%3A5%2C%22center%22%3A%7B%22lat%22%3A54.00366%2C%22lng%22%3A-2.547855%7D%7D; _gid=GA1.3.1000954634.1584850972; _gat=1; __atuvc=11%7C12%2C4%7C13; __atuvs=5e773c3c593ef6aa000; __atssc=google%3B7",
"Host": "www.isc.co.uk",
"Origin": "https://www.isc.co.uk",
"Referer": "https://www.isc.co.uk/schools/",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3927.0 Safari/537.36"}
response = requests.post('https://www.isc.co.uk/Umbraco/Api/FindSchoolApi/FindSchoolListResults?skip=20&take=20', headers = headers)
response.status_code

The website is loaded with JavaScript event which render it's data dynamically once the page loads.
requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.
You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch
import requests
import csv
data = {'locationLatitude': None, 'locationLongitude': None, 'distanceInMiles':
0, 'residencyTypes': [], 'genderGroup': None, 'ageRange': None, 'religiousAffiliation': None, 'financialAssistances': [], 'examinations': [], 'specialNeeds': False, 'scholarshipsAndBurseries': False, 'latitudeSW': 47.823214345168694, 'longitudeSW': -18.049563984375, 'latitudeNE': 59.385618287793505, 'longitudeNE': 12.953853984375021, 'contactCountyID': 0, 'contactCountryID': 0, 'londonBoroughID': 0, 'filterByBounds': True, 'savedBounds': True, 'zoom': 5, 'center': {'lat': 54.00366, 'lng': -2.547855}}
r = requests.post(
"https://www.isc.co.uk/Umbraco/Api/FindSchoolApi/FindSchoolListResults?skip=0&take=20", json=data).json()
with open("data.csv", 'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Address", "Phone", "Email"])
for item in r:
writer.writerow(
[item["Name"], item["FullAddress"], item["TelephoneNumber"], item["EmailAddress"]])
print("Done")
Output : View-Online

Related

How to bypass Cloudflare security using httpx request?

Ok, I`m trying to get html body from one site, with CloudFlare security.
I wrote the following code:
def reqja3():
"""Get request"""
import ssl, httpx
ssl_ctx = ssl.SSLContext(protocol=ssl.PROTOCOL_TLSv1_2)
ssl_ctx.set_alpn_protocols(["h2", "http/1.1"])
ssl_ctx.set_ecdh_curve("prime256v1")
ssl_ctx.set_ciphers(
"TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:"
"TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:"
"ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:"
"ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:"
"DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:"
"ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:"
"ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:"
"DHE-RSA-AES256-SHA256:ECDHE-ECDSA-AES128-SHA256:"
"ECDHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA256:"
"ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:"
"DHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:"
"ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:"
"RSA-PSK-AES256-GCM-SHA384:DHE-PSK-AES256-GCM-SHA384:"
"RSA-PSK-CHACHA20-POLY1305:DHE-PSK-CHACHA20-POLY1305:"
"ECDHE-PSK-CHACHA20-POLY1305:AES256-GCM-SHA384:"
"PSK-AES256-GCM-SHA384:PSK-CHACHA20-POLY1305:"
"RSA-PSK-AES128-GCM-SHA256:DHE-PSK-AES128-GCM-SHA256:"
"AES128-GCM-SHA256:PSK-AES128-GCM-SHA256:AES256-SHA256:"
"AES128-SHA256:ECDHE-PSK-AES256-CBC-SHA384:"
"ECDHE-PSK-AES256-CBC-SHA:SRP-RSA-AES-256-CBC-SHA:"
"SRP-AES-256-CBC-SHA:RSA-PSK-AES256-CBC-SHA384:"
"DHE-PSK-AES256-CBC-SHA384:RSA-PSK-AES256-CBC-SHA:"
"DHE-PSK-AES256-CBC-SHA:AES256-SHA:PSK-AES256-CBC-SHA384:"
"PSK-AES256-CBC-SHA:ECDHE-PSK-AES128-CBC-SHA256:ECDHE-PSK-AES128-CBC-SHA:"
"SRP-RSA-AES-128-CBC-SHA:SRP-AES-128-CBC-SHA:RSA-PSK-AES128-CBC-SHA256:"
"DHE-PSK-AES128-CBC-SHA256:RSA-PSK-AES128-CBC-SHA:"
"DHE-PSK-AES128-CBC-SHA:AES128-SHA:PSK-AES128-CBC-SHA256:PSK-AES128-CBC-SHA"
)
client = httpx.Client(http2=True, verify=ssl_ctx)
print(
client.get(
"https://betway.com/en/sports",
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "betway.com",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
"Cookie": "bw_BrowserId=76501619282281851930349603208014741540; _ga=GA1.2.1003986219.1650841035; _fbp=fb.1.1650841035798.971215073; COOKIE_POLICY_ACCEPTED=true; TrackingVisitId=67e26f62-e357-443d-be0c-83223d7ab902; hash=67e26f62-e357-443d-be0c-83223d7ab902; bw_SessionId=47d27eaa-623f-4434-a03b-9716d4b829a0; StaticResourcesVersion=12.43.0.1; ssc_btag=67e26f62-e357-443d-be0c-83223d7ab902; SpinSportVisitId=d369d188-39c6-41c6-8058-5aa297dd50c0; userLanguage=en; TimezoneOffset=120; _gid=GA1.2.381640013.1652975492; _gat_UA-1515961-1=1; ens_firstPageView=true; _gat=1; AMCVS_74756B615BE2FD4A0A495EB8%40AdobeOrg=1",
},
).text
)
reqja3()
Cloudflare can bypass with "right request" so, u don't need to use JS.
The main thing, it's to make request like browser do.
I set SSL parameters, TLS protocol and http2 ver. and it was working until today.
Now I'm trying to understand what I do wrong.

return results via express with lighthouse npm library

https://googlechrome.github.io/lighthouse/viewer/?psiurl=https%3A%2F%2Fwww.zfcakademi.com%2F&strategy=mobile&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa&utm_source=lh-chrome-ext&output=json
In inquiries we made over the address
https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=https://run-fix.com/&strategy=mobile&utm_source=lh-chrome-ext&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa
The results are returned to us by using the API. However, when using this api, a token is sent via the github page. Access is provided by this token. In order to send a request without a token, you need to refresh the page with the help of cookies created after opening the page. Otherwise, it gives 403 access permissions error.
axios.get(`https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=https://www.zfcakademi.com/&strategy=desktop&utm_source=lh-chrome-ext&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa`, {
"headers": {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "tr,en-US;q=0.9,en;q=0.8,tr-TR;q=0.7,ru;q=0.6",
"cache-control": "max-age=0",
"sec-ch-dpr": "1.5",
"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"101\", \"Google Chrome\";v=\"101\"",
"sec-ch-ua-arch": "\"x86\"",
"sec-ch-ua-bitness": "\"64\"",
"sec-ch-ua-full-version": "\"101.0.4951.64\"",
"sec-ch-ua-full-version-list": "\" Not A;Brand\";v=\"99.0.0.0\", \"Chromium\";v=\"101.0.4951.64\", \"Google Chrome\";v=\"101.0.4951.64\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": "\"\"",
"sec-ch-ua-platform": "\"Windows\"",
"sec-ch-ua-platform-version": "\"10.0.0\"",
"sec-ch-ua-wow64": "?0",
"sec-ch-viewport-width": "853",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"x-client-data": "CIS2yQEIorbJAQjEtskBCKmdygEI9eHKAQiTocsBCNvvywEInvnLAQjmhMwBCJmazAEI2qnMAQiJq8wBCOurzAEIwqzMARirqcoB"
},
"referrerPolicy": "origin",
"body": null,
"method": "GET",
"mode": "cors",
"credentials": "include"
})
.then(resp => {
resolve(resp.data)
})
When I send a request in this way, I get a 403 error as in the picture below and I cannot access it. They have set up a cookie system. In the Lighthouse npm library, only getting data with commands is shown. I'm trying to create an api with express, but I couldn't find the slightest resource how to use it with lighthouse express. Everyone has used it with the help of command line. Can you please help? How can I combine express and lighthouse? For my client, this seo data is very important.
https://www.npmjs.com/package/lighthouse?activeTab=readme

Understanding Bearer Authorization for web scraping using python 3.8 and requests

So I am looking to scrape the following site:
https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
What I am running into using the Python Requests library is that the header requires I pass along an Authorization header that bears a token of some kind. While I can get this to work if I manually go to the page, copy and paste it, and then run my program, I am wondering how I could bypass this issue (After all, what is the point in running a scraper if I still have to visit the actual site manually and retrieve the authorization token).
I am newer to authorization/ bearer headers and am hoping someone might be able to clarify how the browser generates a token to retrieve this information/ how I can simulate this. Here is my code:
import requests
import json
import datetime
today = datetime.datetime.today()
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {
'authority': 'hyland.csod.com',
'origin': 'https://hyland.csod.com',
'authorization': 'Bearer eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCIsImNsaWQiOiI0bDhnbnFhbGk3NjgifQ.eyJzdWIiOi0xMDMsImF1ZCI6IjRxNTFzeG5oY25yazRhNXB1eXZ1eGh6eCIsImNvcnAiOiJoeWxhbmQiLCJjdWlkIjoxLCJ0emlkIjoxNCwibmJkIjoiMjAxOTEyMzEyMTE0MTU5MzQiLCJleHAiOiIyMDE5MTIzMTIyMTUxNTkzNCIsImlhdCI6IjIwMTkxMjMxMjExNDE1OTM0In0.PlNdWXtb1uNoMuGIhI093ZbheRN_DwENTlkNoVr0j7Zah6JHd5cukudVFnZEiQmgBZ_nlDU4C-9JO_2We380Vg',
'content-type': 'application/json',
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'csod-accept-language': 'en-US',
'referer': 'https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland',
'accept-encoding': 'gzip, deflate, br',
'cookie': 'CYBERU_lastculture=en-US; ASP.NET_SessionId=4q51sxnhcnrk4a5puyvuxhzx; cscx=hyland^|-103^|1^|14^|KumB4VhzYXML22MnMxjtTB9SKgHiWW0tFg0HbHnOek4=; c-s=expires=1577909201~access=/clientimg/hyland/*^!/content/hyland/*~md5=78cd5252d2efff6eb77d2e6bf0ce3127',
}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data):
# for openings in data:
dirty = requests.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum)
So basically what my code is doing is sending a post request to the url above along with the headers and necessary data to retrieve what I want. This scraper works for a short period of time, but if I leave and come back after a few hours it no longer works due to authorization (at least that is what I have concluded).
Any help/ clarification on how all this works would be greatly appreciated.

Your code has a few problems:
As you noted you have to get the bearer token
You have to send your requests using requests.session() (as this webpage seems to pay attention to the cookies you send)
Optional: your headers had a lot of unnecessary headers that could be removed
All in all, here bellow is the working code:
import requests
import json
import datetime
today = datetime.datetime.today()
session = requests.session()
url = "https://hyland.csod.com:443/ux/ats/careersite/4/home?c=hyland"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
raw = session.get(url, headers=headers).text
token = raw[raw.index("token")+8:]
token = token[:token.index("\"")]
bearer_token = f"Bearer {token}"
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Authorization": bearer_token}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data, session= session):
# for openings in data:
dirty = session.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum, session = session)
hope this helps

not able to scrape the data and even the links are not changing while clciking on pagination using python

want to scrape data from of each block and want to change the pages, but not able to do that,help me someone to crack this.
i tried to crawl data using headers and form data , but fail to do that.
below is my code.
from bs4 import BeautifulSoup
import requests
url='http://www.msmemart.com/msme/listings/company-list/agriculture-product-stocks/1/585/Supplier'
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "unique_visitor=49.35.36.33; __utma=189548087.1864205671.1549441624.1553842230.1553856136.3; __utmc=189548087; __utmz=189548087.1553856136.3.3.utmcsr=nsic.co.in|utmccn=(referral)|utmcmd=referral|utmcct=/; csrf=d665df0941bbf3bce09d1ee4bd2b079e; ci_session=ac6adsb1eb2lcoogn58qsvbjkfa1skhv; __utmt=1; __utmb=189548087.26.10.1553856136",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json, text/javascript, */*; q=0.01",
}
data ={
'catalog': 'Supplier',
'category':'1',
'subcategory':'585',
'type': 'company-list',
'csrf': '0db0757db9473e8e5169031b7164f2a4'
}
r = requests.get(url,data=data,headers=headers)
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div',{"id":"listings_result"})
for prod in div.find_all('b',string='Products/Services:').next_sibling:
print(prod)
getting "ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host" 2-3 times, i want to crawl all text details in a block.plz someone help me to found this.

i want to crawl a website with python,but i meet a trouble . requests library is ok but 400 with Scrapy,the code below

i want to crawl a website with python,but i meet a trouble . requests library is ok but 400 with Scrapy,the code below
import requests
urls = "https://pan.baidu.com/s/1sj1JLJv"
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
'Content-Length': '0',
"Connection": "keep-alive"<br>
}
print(str((requests.get(urls, headers=header)).content, 'utf-8'))
from scrapy_redis.spiders import RedisCrawlSpider
class baiduuSpider(RedisCrawlSpider):
...
...
...
urls = "https://pan.baidu.com/s/1sj1JLJv"
yield scrapy.Request(url = urls,headers = headers,callback = self.first_parse)
def first_parse(self, response):
print(response.body.decode('utf-8'))
How do I fix this question

I'm sorry, but you won't succeed, because the page loads dynamically.
It is necessary to compile javascript on the fly - Selenium, Splash

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping ajax xmlhttprequest using python - python-3.x

Related

How to bypass Cloudflare security using httpx request?

return results via express with lighthouse npm library

Understanding Bearer Authorization for web scraping using python 3.8 and requests

not able to scrape the data and even the links are not changing while clciking on pagination using python

i want to crawl a website with python,but i meet a trouble . requests library is ok but 400 with Scrapy,the code below

Categories

Resources