Cloudflare bypass with Puppeteer

Cloudflare bypass with Puppeteer - node.js

I'm trying to enter hotbit.io, with my Puppeteer. But I'm met with "Checking your browser before accessing www.hotbit.io" the moment puppeteer tries entering the page.
When I run my program in "headless: false" it redirects to the page after 5 seconds. But my problem is, that I want to run it in headless: true.
When I run it in headless: true, it timesout on the cloudflare page
Screenshot at timeout
I have tried:
"puppeteer-extra-plugin-stealth"
"Cloudflare-scraper (https://www.npmjs.com/package/cloudflare-scraper)". This has very limited documentation (non-existing), but I saw under "Issues" on their github, that it is not supported anymore.
It seems like, that cloudflare knows, that I'm having headless activated.
Does anyone know, how I can skip the cloudflare redirecting page?

Thank you #BGPHiJACK !
It helped by setting user agent to: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0
So right after I have init the page, I set user agent.
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0')

Related

Trackjs: ignore rules by token in user agent

In TrackJS, some user agents are parsed as normal browsers, e.g.:
Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
Chrome Mobile 59.0.3071
I tried to do it by ignore rules in settings, but it doesn't work.
So I need to filtrate errors by token in user agent.
Is it possible do this without JS?
More similar user agents: https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

The TrackJS UI doesn't allow you to create Ignore Rules against the raw UserAgent, only the parsed browser and operating system. Instead, use the client-side ignore capability with the onError function.
Build your function to detect the tokens you want to exclude, and return false from the function if you don't want it to be sent.

Python Web data colecting

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()

tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

Getting content from page which checks for js

I am using "request" module to get page contents with following headers
var headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0' };
still, the page I am trying to fetch somehow displays different content than View > Source from browser (looks like it detects for javascript support) , before diving into phantomjs (which I want to avoid due performance limitations) is there any way to get the html as it is on the browser?.
Thanks

Browser specific hosts file

I have a web application that I am working on, and we have three servers - Production, Staging(QA), and Dev.
Is there a way to have a specific browser point to on server, and another browser point to a different server? IE: Firefox points to Production, Safari to Staging, and Chrome to Dev?

As you pointed out:
navigator.appName resolves to "Microsoft Internet Explorer", not "Internet Explorer" like you have written.
Also, the first character navigator.appVersion will not provide you with the version of the browser. In IE 10, it resolves to "5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0;"
To make your code work, you need to update it to something like:
function get_browser_version(){
var N=navigator.appName, ua=navigator.userAgent, tem;
var M=ua.match(/(opera|chrome|safari|firefox|msie)\/?\s*(\.?\d+(\.\d+)*)/i);
if(M && (tem= ua.match(/version\/([\.\d]+)/i))!= null) M[2]= tem[1];
M=M? [M[1], M[2]]: [N, navigator.appVersion, '-?'];
return M[1];
}
var browser = navigator.appName;
var version = get_browser_version();
if (browser=="Microsoft Internet Explorer") {
if (version<="8.1")
document.location.href="lores.htm"
}

windows phone browser in desktop mode

How can force a windows phone to use the desktop view mode in the mobile borwser?
In the settings it is possible to set the browser to use the desktop view becasue some featers seem to be missing in the mobile view causing my site not beeing displayed correctly.

If you want to make websites display in desktop mode in the WebBrowser control, you must change its user agent. You can do so using this:
webBrowser.Navigate(new Uri("http://www.google.com", null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
That code changes the WebBrowser's user agent to that of desktop Internet Explorer 10.
However, it will only change the User Agent for the page navigated to. When users click links, the user agent will be changed back. To fix this, set the WebBrowser's Navigating event to this:
private void webBrowser_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
string url = e.Uri.ToString();
if (!url.Contains("#changedua"))
{
e.Cancel = true;
url = url + "#changedua";
webBrowser.Navigate(new Uri(url), null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
}
}
In this code, we check to see if the url contains a flag, "#changedua". If it does, we allow the navigation. If it does not, we cancel the navigation. Then, we navigate again using our custom user agent, and adding the flag to show that it is valid.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cloudflare bypass with Puppeteer - node.js

Related

Trackjs: ignore rules by token in user agent

Python Web data colecting

Getting content from page which checks for js

Browser specific hosts file

windows phone browser in desktop mode

Categories

Resources