I'm using node-fetch and https-proxy-agent to make a request using a proxy, however, I get a 400 error code from the site I'm scraping only when I send the agent, without it, everything works fine.
import fetch from 'node-fetch';
import Proxy from 'https-proxy-agent';
const ip = PROXIES[Math.floor(Math.random() * PROXIES.length)]; // PROXIES is a list of ips
const proxyAgent = Proxy(`http://${ip}`);
fetch(url, {
agent: proxyAgent,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.72 Safari/537.36'
}
}).then(res => res.text()).then(console.log)
This results in a 400 error code like so:
I have absolutely no idea why this is happening. If you want to reproduce the issue, I'm scraping https://azlyrics.com. Please let me know what is wrong.
The issue has been fixed. I did not notice I was making a request to a https site with a http proxy. The site was using https protocol but the proxies were http only. Changing to https proxies works. Thank you.
I am using axios and a API (cowin api https://apisetu.gov.in/public/marketplace/api/cowin/cowin-public-v2) which has strong kind of protection against the web requests.
When I was getting error 403 on my dev machine (Windows) then, I solve it by just adding a header 'User-Agent'.
When I have deployed it to heroku I am still getting the same error.
const { data } = await axios.get(url, {
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
},
})
Using a fake user-agent in your headers can help with this problem, but there are other variables you may want to consider.
For example, if you are making multiple HTTP requests you may want to have multiple fake user-agents to and then randomize the user-agent for every request made. This can help limit the changes of your scraper being detected.
If that still doesn't work you may want to consider optimizing your headers further. Other than sending HTTP requests with a randomized user-agent, you can further imitate a browser's request Headers by adding more Headers than just the "user-agent"- then ensuring that the user-agent that is selected is consistent with the information sent from the rest of the headers.
You can check out here for more information.
On the site it will not only provide information on how to optimize your headers consistently with the user-agent, but also provide more solutions in case the above mentioned still was unsuccessful.
In my situation, it was the case that I had to bypass cloudflare. You can determine if this is your situation as well if you log your error to the terminal and then check if under the "server" key it says "cloudflare". In which case you can use this documentation for further assistance.
I am trying to achieve similar results as inside of Network Tab of Google Chrome browser. I'm interested in sizes of website.
I wrote some code using nodejs and puppeteer where I manipulate userAgent header. I am quite sure that the header is properly set but instead of mobile version of an image from the web server I am keep getting desktop version. As an example take https:/www.nba.com/
When opening above webpage with google chrome as if it was viewed by iPhone device I get as one of responses mobile version of splash_screen.jpeg. When opening the very same website as if it was viewed from PC I get PC version of splash_screen.jpg. When I set exactly the same userAgents as in requests of Google Chrome with puppeteer there is no difference regarding responses
To setup userAgent I'm doing:
if (process.argv[4] == 'mobile') {
await page.setUserAgent('Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1');
} else {
await page.setUserAgent('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Mobile Safari/537.36');
}
To manipulate response I do:
page._client.on('Network.dataReceived', (event) => {
const request = page._networkManager._requestIdToRequest.get(
event.requestId
);
if (request && request.url().startsWith('data:')) {
return;
}
// some extra stuff down here
});
I expect my output to be the same as it is in Network Tab of Google Chrome browser. Is there any way to achieve it?
Thanks in advance.
I am using requests and beautifulsoup to go through the popular comic store comixology in order to make a list of all comic titles and issues and release date for all of them, so I am requesting a massive amount of web pages. Unfortunately, partway through i will get the error:
you do not have access to (URL) on this server
I tried using a function that recursively tries the request. but this isn't working
Im not putting the whole code in because it is very long.
def getUrl(url):
try:
page = requests.get(url)
except:
getUrl(url)
return page
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('http://example.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
In node.js (using Hapi framework) I'm creating link for user to allow my app reading user account. Google handles that request and asks about giving permissions. Then Google makes redirect to my server with GET parameter as a response code and here I have an issue.
Google Chrome isn't sending cookie with session ID.
If I mark that cookie as a session cookie in cookie edit extension, it is sent. Same behavior in php, but php marks cookie as session when creating session, so it isn't problem. I'm using plugin hapi-auth-cookie, it creates session and handles everything about it. I also mark that cookie then in hapi-auth-cookie settings as non HttpOnly, because it was first difference, that I have noticed, when inspecting that PHP session cookie and mine in node.js. I have response 401 missing authentication on each redirect. If I place cursor in adress bar and hit enter, everything works fine, so it is an issue with redirect.
My question is basically, what may be causing that behavior. On the other hand I have to mention that firefox sends cookie after each request without any issues.
Headers after redirect (no cookie with session):
{
"host": "localhost:3000",
"connection": "keep-alive",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"x-client-data": "CJS2eQHIprbJAQjEtskECKmdygE=",
"x-chrome-connected": "id=110052060380026604986,mode=0,enable_account_consistency=false",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4"
}
Headers after hitting enter in adress bar (what will work fine):
{
"host": "localhost:3000",
"connection": "keep-alive",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-encoding": "gzip, deflate, sdch, br",
"accept-language": "pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4",
"cookie": "SESSID=very_long_string"
}
Strict cookies are not sent by the browser if the referrer is a different site. This will happen if the request is a redirect from a different site. Using lax will get around this issue, or you can make your site deal with not being able to access strict cookies on your first request.
I came across this issue recently and wrote more detail on strict cookies, referrers and redirects.
This issue is caused by hapi-auth-cookie not dealing yet with isSameSite (new feature of Hapi). We can set it manually, eg.
const server = new Hapi.Server(
connections: {
state: {
isSameSite: 'Lax'
}
}
);
But please consider that, by default you have 'Strict' option, and in many cases you may not want to change that value.
A recent version of Chrome was displaying this warning in the console:
A cookie associated with a cross-site resource at was set
without the SameSite attribute. A future release of Chrome will only
deliver cookies with cross-site requests if they are set with
SameSite=None and Secure.
My server redirects a user to an authentication server if they didn't have a valid cookie. Upon authentication, the user would be redirected back to my server with a validation code. If the code was verified, the user would be redirected again into the website with a valid cookie.
I added the SameSite=Secure option to the cookie but Chrome ignored the cookie after a redirect from the authentication server. Removing that option fixed the problem, but the warning still appears.
A standalone demo of this issue: https://gist.github.com/isaacs/8d957edab609b4d122811ee945fd92fd
It's a bug in Chrome.