return results via express with lighthouse npm library - node.js

https://googlechrome.github.io/lighthouse/viewer/?psiurl=https%3A%2F%2Fwww.zfcakademi.com%2F&strategy=mobile&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa&utm_source=lh-chrome-ext&output=json
In inquiries we made over the address
https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=https://run-fix.com/&strategy=mobile&utm_source=lh-chrome-ext&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa
The results are returned to us by using the API. However, when using this api, a token is sent via the github page. Access is provided by this token. In order to send a request without a token, you need to refresh the page with the help of cookies created after opening the page. Otherwise, it gives 403 access permissions error.
axios.get(`https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=https://www.zfcakademi.com/&strategy=desktop&utm_source=lh-chrome-ext&category=performance&category=accessibility&category=best-practices&category=seo&category=pwa`, {
"headers": {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "tr,en-US;q=0.9,en;q=0.8,tr-TR;q=0.7,ru;q=0.6",
"cache-control": "max-age=0",
"sec-ch-dpr": "1.5",
"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"101\", \"Google Chrome\";v=\"101\"",
"sec-ch-ua-arch": "\"x86\"",
"sec-ch-ua-bitness": "\"64\"",
"sec-ch-ua-full-version": "\"101.0.4951.64\"",
"sec-ch-ua-full-version-list": "\" Not A;Brand\";v=\"99.0.0.0\", \"Chromium\";v=\"101.0.4951.64\", \"Google Chrome\";v=\"101.0.4951.64\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": "\"\"",
"sec-ch-ua-platform": "\"Windows\"",
"sec-ch-ua-platform-version": "\"10.0.0\"",
"sec-ch-ua-wow64": "?0",
"sec-ch-viewport-width": "853",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"x-client-data": "CIS2yQEIorbJAQjEtskBCKmdygEI9eHKAQiTocsBCNvvywEInvnLAQjmhMwBCJmazAEI2qnMAQiJq8wBCOurzAEIwqzMARirqcoB"
},
"referrerPolicy": "origin",
"body": null,
"method": "GET",
"mode": "cors",
"credentials": "include"
})
.then(resp => {
resolve(resp.data)
})
When I send a request in this way, I get a 403 error as in the picture below and I cannot access it. They have set up a cookie system. In the Lighthouse npm library, only getting data with commands is shown. I'm trying to create an api with express, but I couldn't find the slightest resource how to use it with lighthouse express. Everyone has used it with the help of command line. Can you please help? How can I combine express and lighthouse? For my client, this seo data is very important.
https://www.npmjs.com/package/lighthouse?activeTab=readme

Related

How to bypass Cloudflare security using httpx request?

Ok, I`m trying to get html body from one site, with CloudFlare security.
I wrote the following code:
def reqja3():
"""Get request"""
import ssl, httpx
ssl_ctx = ssl.SSLContext(protocol=ssl.PROTOCOL_TLSv1_2)
ssl_ctx.set_alpn_protocols(["h2", "http/1.1"])
ssl_ctx.set_ecdh_curve("prime256v1")
ssl_ctx.set_ciphers(
"TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:"
"TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:"
"ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:"
"ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:"
"DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:"
"ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:"
"ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:"
"DHE-RSA-AES256-SHA256:ECDHE-ECDSA-AES128-SHA256:"
"ECDHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA256:"
"ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:"
"DHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:"
"ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:"
"RSA-PSK-AES256-GCM-SHA384:DHE-PSK-AES256-GCM-SHA384:"
"RSA-PSK-CHACHA20-POLY1305:DHE-PSK-CHACHA20-POLY1305:"
"ECDHE-PSK-CHACHA20-POLY1305:AES256-GCM-SHA384:"
"PSK-AES256-GCM-SHA384:PSK-CHACHA20-POLY1305:"
"RSA-PSK-AES128-GCM-SHA256:DHE-PSK-AES128-GCM-SHA256:"
"AES128-GCM-SHA256:PSK-AES128-GCM-SHA256:AES256-SHA256:"
"AES128-SHA256:ECDHE-PSK-AES256-CBC-SHA384:"
"ECDHE-PSK-AES256-CBC-SHA:SRP-RSA-AES-256-CBC-SHA:"
"SRP-AES-256-CBC-SHA:RSA-PSK-AES256-CBC-SHA384:"
"DHE-PSK-AES256-CBC-SHA384:RSA-PSK-AES256-CBC-SHA:"
"DHE-PSK-AES256-CBC-SHA:AES256-SHA:PSK-AES256-CBC-SHA384:"
"PSK-AES256-CBC-SHA:ECDHE-PSK-AES128-CBC-SHA256:ECDHE-PSK-AES128-CBC-SHA:"
"SRP-RSA-AES-128-CBC-SHA:SRP-AES-128-CBC-SHA:RSA-PSK-AES128-CBC-SHA256:"
"DHE-PSK-AES128-CBC-SHA256:RSA-PSK-AES128-CBC-SHA:"
"DHE-PSK-AES128-CBC-SHA:AES128-SHA:PSK-AES128-CBC-SHA256:PSK-AES128-CBC-SHA"
)
client = httpx.Client(http2=True, verify=ssl_ctx)
print(
client.get(
"https://betway.com/en/sports",
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "betway.com",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
"Cookie": "bw_BrowserId=76501619282281851930349603208014741540; _ga=GA1.2.1003986219.1650841035; _fbp=fb.1.1650841035798.971215073; COOKIE_POLICY_ACCEPTED=true; TrackingVisitId=67e26f62-e357-443d-be0c-83223d7ab902; hash=67e26f62-e357-443d-be0c-83223d7ab902; bw_SessionId=47d27eaa-623f-4434-a03b-9716d4b829a0; StaticResourcesVersion=12.43.0.1; ssc_btag=67e26f62-e357-443d-be0c-83223d7ab902; SpinSportVisitId=d369d188-39c6-41c6-8058-5aa297dd50c0; userLanguage=en; TimezoneOffset=120; _gid=GA1.2.381640013.1652975492; _gat_UA-1515961-1=1; ens_firstPageView=true; _gat=1; AMCVS_74756B615BE2FD4A0A495EB8%40AdobeOrg=1",
},
).text
)
reqja3()
Cloudflare can bypass with "right request" so, u don't need to use JS.
The main thing, it's to make request like browser do.
I set SSL parameters, TLS protocol and http2 ver. and it was working until today.
Now I'm trying to understand what I do wrong.

XMLHttpRequest from VBA returns error 406 "Not Accepted"

I'm trying to do some scraping using Excel VBA, specifically, to get a JSON from a server.
I'm using exactly the same request headers that show on the browsers dev tools, but when I do this request from Excel I always get a 406 ("Not accepted") answer from the server.
What am I missing?
Sub getJSON()
Dim XHR As New MSXML2.ServerXMLHTTP60
XHR.Open "GET", "https://www.magazineluiza.com.br/produto/calculo-frete/08226021/615254900/frigelar.json", False
XHR.setRequestHeader "Host", "www.magazineluiza.com.br"
XHR.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
XHR.setRequestHeader "Accept", "application/json, text/javascript, */*; q=0.01"
XHR.setRequestHeader "Accept-Language", "pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3"
XHR.setRequestHeader "Accept-Encoding", "gzip, deflate, br"
XHR.setRequestHeader "X-Requested-With", "XMLHttpRequest"
XHR.setRequestHeader "Connection", "keep-alive"
XHR.setRequestHeader "Referer", "https://www.magazineluiza.com.br/refrigerador-expositor-para-bebidas-metalfrio-com-controlador-eletronico-497-litros-vb52r-220v/p/6152549/pi/expb/"
XHR.setRequestHeader "TE", "Trailers"
XHR.send
Debug.Print XHR.responseText
End Sub
This is the answer I get on the browser:
{"address":{"address":"18 DE ABRIL","area":"CIDADE ANTONIO ESTEVAO DE CARVALHO","city":"SAO PAULO","state":"SP","zip_code":"08226021"},"delivery":[{"description":"Entrega padrĂ£o","distribution_center":0,"is_deadline":true,"price":"263.91","time":13,"zip_code_restriction":false}],"is_delivery_unavailable":false,"zip_code":"08226021"}
Edit: This JSON is retrieved by the browser when I add the zip code "08226021" on the right side of this page and click "Ok".
Try this to get the required response:
Sub FetchJsonResponse()
Const Url$ = "https://www.magazineluiza.com.br/produto/calculo-frete/08226021/615254900/frigelar.json"
With CreateObject("WinHttp.WinHttpRequest.5.1")
.Open "GET", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
.setRequestHeader "Referer", "https://www.magazineluiza.com.br/refrigerador-expositor-para-bebidas-metalfrio-com-controlador-eletronico-497-litros-vb52r-220v/p/6152549/pi/expb/"
.setRequestHeader "x-requested-with", "XMLHttpRequest"
.send
Debug.Print .responseText
End With
End Sub
This is the output it produces:
{"address": {"address": "18 DE ABRIL", "area": "CIDADE ANTONIO ESTEVAO DE CARVALHO", "city": "SAO PAULO", "state": "SP", "zip_code": "08226021"}, "delivery": [{"description": "Entrega padr\u00e3o", "distribution_center": 0, "is_deadline": true, "price": "263.91", "time": 13, "zip_code_restriction": false}], "is_delivery_unavailable": false, "zip_code": "08226021"}

Scraping ajax xmlhttprequest using python

I want to scrape school name, address, phone, email in UK from https://www.isc.co.uk/schools/ this website using xmlhttpeequest. It returns error 500.
import requests
headers = {"Accept": "application/json, text/plain, */*","Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Type": "application/json;charset=UTF-8",
"Cookie": "_ga=GA1.3.1302518161.1584461820; _hjid=5f23e8d2-23c6-4c87-9cc0-ca216b587ae1; cookie_preference=false; iscFilterStates=%7B%22locationLatitude%22%3Anull%2C%22locationLongitude%22%3Anull%2C%22distanceInMiles%22%3A0%2C%22residencyTypes%22%3A%5B%5D%2C%22genderGroup%22%3Anull%2C%22ageRange%22%3Anull%2C%22religiousAffiliation%22%3Anull%2C%22financialAssistances%22%3A%5B%5D%2C%22examinations%22%3A%5B%5D%2C%22specialNeeds%22%3Afalse%2C%22scholarshipsAndBurseries%22%3Afalse%2C%22latitudeSW%22%3A47.823214345168694%2C%22longitudeSW%22%3A-18.049563984375%2C%22latitudeNE%22%3A59.385618287793505%2C%22longitudeNE%22%3A12.953853984375021%2C%22contactCountyID%22%3A0%2C%22contactCountryID%22%3A0%2C%22londonBoroughID%22%3A0%2C%22filterByBounds%22%3Atrue%2C%22savedBounds%22%3Atrue%2C%22zoom%22%3A5%2C%22center%22%3A%7B%22lat%22%3A54.00366%2C%22lng%22%3A-2.547855%7D%7D; _gid=GA1.3.1000954634.1584850972; _gat=1; __atuvc=11%7C12%2C4%7C13; __atuvs=5e773c3c593ef6aa000; __atssc=google%3B7",
"Host": "www.isc.co.uk",
"Origin": "https://www.isc.co.uk",
"Referer": "https://www.isc.co.uk/schools/",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3927.0 Safari/537.36"}
response = requests.post('https://www.isc.co.uk/Umbraco/Api/FindSchoolApi/FindSchoolListResults?skip=20&take=20', headers = headers)
response.status_code
The website is loaded with JavaScript event which render it's data dynamically once the page loads.
requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.
You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch
import requests
import csv
data = {'locationLatitude': None, 'locationLongitude': None, 'distanceInMiles':
0, 'residencyTypes': [], 'genderGroup': None, 'ageRange': None, 'religiousAffiliation': None, 'financialAssistances': [], 'examinations': [], 'specialNeeds': False, 'scholarshipsAndBurseries': False, 'latitudeSW': 47.823214345168694, 'longitudeSW': -18.049563984375, 'latitudeNE': 59.385618287793505, 'longitudeNE': 12.953853984375021, 'contactCountyID': 0, 'contactCountryID': 0, 'londonBoroughID': 0, 'filterByBounds': True, 'savedBounds': True, 'zoom': 5, 'center': {'lat': 54.00366, 'lng': -2.547855}}
r = requests.post(
"https://www.isc.co.uk/Umbraco/Api/FindSchoolApi/FindSchoolListResults?skip=0&take=20", json=data).json()
with open("data.csv", 'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Address", "Phone", "Email"])
for item in r:
writer.writerow(
[item["Name"], item["FullAddress"], item["TelephoneNumber"], item["EmailAddress"]])
print("Done")
Output : View-Online

Problems with session in requests on python

Im trying to login to Plus500 with in python. Everething is ok, with status code 200, and response from server. But the server will not accept my requuest.
I did every single step that the webBrowser thoes. Headers like web browser. Always the same result.
url = "https://trade.plus500.com/AppInitiatedImm/WebTrader2/?webvisitid=" + self.tokesession+ "&page=login&isInTradeContext=false"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Host": "trade.plus500.com",
"Connection": "keep-alive",
"Referer": "https://app.plus500.com/trade",
"Cookie":"webvisitid=" + self.cookiesession + ";"+\
"IP="+self.hasip}
param = "ClientType=WebTrader2&machineID=33b5db48501c9b0e5552ea135722b2c6&PrimaryMachineId=33b5db48501c9b0e5552ea135722b2c6&hl=en&cl=en-GB&AppVersion=87858&refurl=https%3A%2F%2Fwww.plus500.co.uk%2F&SessionID=0&SubSessionID=0"
response = self.session.request(method="POST",
url=url,
params=param,
headers=header,stream=True)
The code above is the initialization of the web app. after that do the login. But it always come up with JSON reply : AppSessionRquired. I think I already try everinthing that i can think of. If some one as idea.

not able to scrape the data and even the links are not changing while clciking on pagination using python

want to scrape data from of each block and want to change the pages, but not able to do that,help me someone to crack this.
i tried to crawl data using headers and form data , but fail to do that.
below is my code.
from bs4 import BeautifulSoup
import requests
url='http://www.msmemart.com/msme/listings/company-list/agriculture-product-stocks/1/585/Supplier'
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "unique_visitor=49.35.36.33; __utma=189548087.1864205671.1549441624.1553842230.1553856136.3; __utmc=189548087; __utmz=189548087.1553856136.3.3.utmcsr=nsic.co.in|utmccn=(referral)|utmcmd=referral|utmcct=/; csrf=d665df0941bbf3bce09d1ee4bd2b079e; ci_session=ac6adsb1eb2lcoogn58qsvbjkfa1skhv; __utmt=1; __utmb=189548087.26.10.1553856136",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json, text/javascript, */*; q=0.01",
}
data ={
'catalog': 'Supplier',
'category':'1',
'subcategory':'585',
'type': 'company-list',
'csrf': '0db0757db9473e8e5169031b7164f2a4'
}
r = requests.get(url,data=data,headers=headers)
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div',{"id":"listings_result"})
for prod in div.find_all('b',string='Products/Services:').next_sibling:
print(prod)
getting "ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host" 2-3 times, i want to crawl all text details in a block.plz someone help me to found this.

Resources