I've been trying to extract some data from a website, but the only way I'm able to get something useful is through Powershell.
The script I'm running from Powershell is:
Invoke-WebRequest -Uri "https://www.pelispedia.tv/api/iframes.php?id=18471?nocache" -Headers #{"method"="GET"; "authority"="www.pelispedia.tv"; "scheme"="https"; "path"="/api/iframes.php?id=18471?nocache"; "upgrade-insecure-requests"="1"; "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"; "accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"; "referer"="https://www.pelispedia.tv/pelicula/el-nino-que-domo-el-viento/"; "accept-encoding"="gzip, deflate, br"; "accept-language"="es,en;q=0.9"} | Select-Object -Expand Content
I got it from Chromes's Network tab inside the DevTools while watching this site load: https://www.pelispedia.tv/pelicula/el-nino-que-domo-el-viento/
Devtools Screenshot - also includes cURL and fetch
The response is a full HTML site, which I want to use later.
The fetch script is:
fetch("https://www.pelispedia.tv/api/iframes.php?id=18471?nocache", {
"credentials": "include",
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "es,en;q=0.9",
"upgrade-insecure-requests": "1"
},
"referrer": "https://www.pelispedia.tv/pelicula/el-nino-que-domo-el-viento/",
"referrerPolicy": "no-referrer-when-downgrade",
"body": null,
"method": "GET",
"mode": "cors"
})
.then(res => res.text())
.then(body => console.log(body));
I tried using multiple NodeJS packages like node-fetch, axios and request to get the same result as in Powershell, but I simply get an HTML with the line "ERROR".
This approach does not work in NodeJS, but if I run it from within Chrome's console, while I'm from the site, it works.
I would like to know what Powershell is doing to get the correct response and how to recreate it in Node or any other language/runtime (Java, Python, PHP...).
Using fetch form chrome dev tools and using fetch from node or using Powershell are completely different things.
fetch form chrome dev tools has all the headers and other thing attached to the request as the browser does so it is essentially your browser making the request as perceived by the server of the website.
But in case of PowerShell or nodejs request or fetch, all those headers, referer, and many other things are stripped off. So the server rejects the request considering you a bot.
Related
I'm using node-fetch and https-proxy-agent to make a request using a proxy, however, I get a 400 error code from the site I'm scraping only when I send the agent, without it, everything works fine.
import fetch from 'node-fetch';
import Proxy from 'https-proxy-agent';
const ip = PROXIES[Math.floor(Math.random() * PROXIES.length)]; // PROXIES is a list of ips
const proxyAgent = Proxy(`http://${ip}`);
fetch(url, {
agent: proxyAgent,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.72 Safari/537.36'
}
}).then(res => res.text()).then(console.log)
This results in a 400 error code like so:
I have absolutely no idea why this is happening. If you want to reproduce the issue, I'm scraping https://azlyrics.com. Please let me know what is wrong.
The issue has been fixed. I did not notice I was making a request to a https site with a http proxy. The site was using https protocol but the proxies were http only. Changing to https proxies works. Thank you.
I am trying to make an web-app that notifies when new vaccine slots arrive on government portal using provided public APIs.
What i need is to call the API every minute and check if the slots have been added to the database. But the response I am getting is stale as the new sessions detected by my app(also in Chrome) were about 5 minutes old, I know this because some telegram channels are showing update earlier than my app.
Also, when I try to hit the same API with Postman, the response I am getting is fresh.
Issue is - Chorme/myApp response is not reflecting the updated database... but postman is showing the updated one... chrome is getting the updated response 5 mins after its showing in postman.
Public API: https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/calendarByDistrict?district_id=141&date=06-07-2021
let response = await fetch(`https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/calendarByDistrict?district_id=${id}&date=${today}`, {
method: 'GET',
headers: {
'Content-Type': 'application/json',
'Connection': 'keep-alive',
},
})
Do I need to change some headers or anything else in my get requests?... or anything else???
Help, me fix it...
So couple of things.
First, use Find by district API instead of Calendar by district API. Thats more accurate.
https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/findByDistrict?district_id=512&date=31-03-2021
Second, pass the user agent. This is in PHP, but you can always update to other language.
$header = array(
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Pragma: no-cache",
"Cache-Control: no-cache",
"Accept-Language: en-us",
"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
"Upgrade-Insecure-Requests: 1"
);
I am using axios and a API (cowin api https://apisetu.gov.in/public/marketplace/api/cowin/cowin-public-v2) which has strong kind of protection against the web requests.
When I was getting error 403 on my dev machine (Windows) then, I solve it by just adding a header 'User-Agent'.
When I have deployed it to heroku I am still getting the same error.
const { data } = await axios.get(url, {
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
},
})
Using a fake user-agent in your headers can help with this problem, but there are other variables you may want to consider.
For example, if you are making multiple HTTP requests you may want to have multiple fake user-agents to and then randomize the user-agent for every request made. This can help limit the changes of your scraper being detected.
If that still doesn't work you may want to consider optimizing your headers further. Other than sending HTTP requests with a randomized user-agent, you can further imitate a browser's request Headers by adding more Headers than just the "user-agent"- then ensuring that the user-agent that is selected is consistent with the information sent from the rest of the headers.
You can check out here for more information.
On the site it will not only provide information on how to optimize your headers consistently with the user-agent, but also provide more solutions in case the above mentioned still was unsuccessful.
In my situation, it was the case that I had to bypass cloudflare. You can determine if this is your situation as well if you log your error to the terminal and then check if under the "server" key it says "cloudflare". In which case you can use this documentation for further assistance.
I'm trying to build a web scraping service that gets some data for use in my application. I'm using NodeJS with Axios. The problem is, I'm having some particular difficulty with one webservice and I'm wondering if I'm doing anything wrong. The problem I'm having is that the webservice seems to never return any data to my application and the request just hangs. I've used Axios-Curlirize to debug the issue by getting the cURL command and using that same command both in Terminal and in Postman, and in both cases the request returns almost instantly. My request is not in a loop so I don't think I'm getting hit with anti-DDoS protection, and the exact same method I'm using works fine with other APIs I've tried to use. Does anyone have any idea what might be happening? Here's my code snippet, although it's fairly standard:
return await axios.get(url, { headers: headers})
.then(() => {
console.log("done")
})
.catch(err => failed(err));
Neither the console.log nor the function failed ever get called, I've put breakpoints there to check that; the contents of the function failed are not relevant because that function is never called. I've tried on Node 14.7.0 and on Node 12.18.3LTS and it didn't work on either version.
Here's the cURL request from Axios-Curlirize:
curl -X GET -H "accept-language:en-CA,en-GB;q=0.9,en-US;q=0.8,en;q=0.7" -H "dnt:1" -H "referer:https://www.google.com" -H "upgrade-insecure-requests:1" -H "user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36" -H "cookie-control:no-cache" -H "cache-control:max-age=0" -H "pragma:no-cache" "https://www.famousfootwear.com/stores/product/LocateCartAndNewVariantNearby?zipCode=98662&radius=25&variantToAdd=70094-110-07"
This cURL command works fine in terminal and in Postman but not in my application.
Any help is appreciated!
Figured it out. The remote website only supports HTTP/2 and Axios only supports HTTP1.1. Guess I gotta find another solution other than Axios.
In node.js (using Hapi framework) I'm creating link for user to allow my app reading user account. Google handles that request and asks about giving permissions. Then Google makes redirect to my server with GET parameter as a response code and here I have an issue.
Google Chrome isn't sending cookie with session ID.
If I mark that cookie as a session cookie in cookie edit extension, it is sent. Same behavior in php, but php marks cookie as session when creating session, so it isn't problem. I'm using plugin hapi-auth-cookie, it creates session and handles everything about it. I also mark that cookie then in hapi-auth-cookie settings as non HttpOnly, because it was first difference, that I have noticed, when inspecting that PHP session cookie and mine in node.js. I have response 401 missing authentication on each redirect. If I place cursor in adress bar and hit enter, everything works fine, so it is an issue with redirect.
My question is basically, what may be causing that behavior. On the other hand I have to mention that firefox sends cookie after each request without any issues.
Headers after redirect (no cookie with session):
{
"host": "localhost:3000",
"connection": "keep-alive",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"x-client-data": "CJS2eQHIprbJAQjEtskECKmdygE=",
"x-chrome-connected": "id=110052060380026604986,mode=0,enable_account_consistency=false",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4"
}
Headers after hitting enter in adress bar (what will work fine):
{
"host": "localhost:3000",
"connection": "keep-alive",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4",
"cookie": "SESSID=very_long_string"
}
Strict cookies are not sent by the browser if the referrer is a different site. This will happen if the request is a redirect from a different site. Using lax will get around this issue, or you can make your site deal with not being able to access strict cookies on your first request.
I came across this issue recently and wrote more detail on strict cookies, referrers and redirects.
This issue is caused by hapi-auth-cookie not dealing yet with isSameSite (new feature of Hapi). We can set it manually, eg.
const server = new Hapi.Server(
connections: {
state: {
isSameSite: 'Lax'
}
}
);
But please consider that, by default you have 'Strict' option, and in many cases you may not want to change that value.
A recent version of Chrome was displaying this warning in the console:
A cookie associated with a cross-site resource at was set
without the SameSite attribute. A future release of Chrome will only
deliver cookies with cross-site requests if they are set with
SameSite=None and Secure.
My server redirects a user to an authentication server if they didn't have a valid cookie. Upon authentication, the user would be redirected back to my server with a validation code. If the code was verified, the user would be redirected again into the website with a valid cookie.
I added the SameSite=Secure option to the cookie but Chrome ignored the cookie after a redirect from the authentication server. Removing that option fixed the problem, but the warning still appears.
A standalone demo of this issue: https://gist.github.com/isaacs/8d957edab609b4d122811ee945fd92fd
It's a bug in Chrome.