Puppeteer page request fails only on AWS EC2 instance - node.js

I've written a small javascript program using node (v12.16.2) and puppeteer (v2.1.1) that I'm trying to run on an AWS EC2 instance. I'm doing a goto of the url appended to this. It works fine on a local (non-AWS) linux machine with similar versions, but on the EC2, it fails, not showing the page at all. I've tried running with headless=false and devtools=true. In the browser console, I see this:
Uncaught TypeError: Cannot read property 'length' of undefined
at il_Ev (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1862)
at il_Hv (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1849)
at il_Yv.initialize (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1867)
at il__i (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:270)
at il_Gl.il_Wj.H (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:322)
at rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1869
As I mentioned, this same code works fine on a different linux machine and just loaded inside a browser; no errors. I'm stumped. Does anyone know what might be going on? Other pages, like google.com, load fine in the EC2, FYI. TIA.
Reid
https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg-s-msn-com.akamaized.net%2Ftenant%2Famp%2Fentityid%2FAACPW4S.img%3Fh%3D552%26w%3D750%26m%3D6%26q%3D60%26u%3Dt%26o%3Df%26l%3Df%26x%3D992%26y%3D672&imgrefurl=https%3A%2F%2Fwww.msn.com%2Fen-us%2Flifestyle%2Fpets-animals%2F49-adorable-puppy-pictures-that-will-make-you-melt%2Fss-AACSrEY&tbnid=Ad7wBCCmAXPRDM&vet=12ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw..i&docid=jawDJ74qdYREJM&w=750&h=500&q=puppies&ved=2ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw
Here's an excerpt of the relevant code, which is pretty simple:
const browser = await puppeteer.launch({
headless: false,
devtools: true,
slowMo: 150
});
await browser.userAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
);
/* Get the first page rather than creating a new one unnecessarily. */
let page = (await browser.pages())[0];
await page.setViewport({
width: 1524,
height: 768
});
try {
await page.goto("https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg-s-msn-com.akamaized.net%2Ftenant%2Famp%2Fentityid%2FAACPW4S.img%3Fh%3D552%26w%3D750%26m%3D6%26q%3D60%26u%3Dt%26o%3Df%26l%3Df%26x%3D992%26y%3D672&imgrefurl=https%3A%2F%2Fwww.msn.com%2Fen-us%2Flifestyle%2Fpets-animals%2F49-adorable-puppy-pictures-that-will-make-you-melt%2Fss-AACSrEY&tbnid=Ad7wBCCmAXPRDM&vet=12ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw..i&docid=jawDJ74qdYREJM&w=750&h=500&q=puppies&ved=2ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw", {
timeout: 0,
// waitUntil: ["load"]
// waitUntil: ["networkidle2"]
});
await page.waitForSelector('#irc_shc', {
visible: true,
timeout: 0
});
} catch(e) {
console.log("error: e = ", e);
}

This was just a temporary google page error, it turns out.

Related

How to find / click elements inside iframe in puppeteer?

Setup:
puppeteer - puppeteer#13.7.0
nodejs - v10.19.0
Puppetter launch setup:
"args" :[
'--ignore-certificate-errors',
'--ignore-certifcate-errors-spki-list',
'--no-sandbox',
'--disable-gpu',
'--start-maximized',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--window-size=1200,800',
'--single-process',
'--disable-infobars',
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process',
'--window-position=0,0',
'--no-zygote',
'--no-sandbox',
'--user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36"',
]
I'm trying to upload image onto Gmail -
The "image uploader" is iframe that gets dynamically loaded when you click on "insert photo" icon in the panel where you type receipient / subject / body / etc.
When I query the iframe, I can find the div (including searching by textContent, and getting the ID) but for some reason, theres two things i cannot do
1) get the bounding rect
2) return the element handle to puppetteer
For some reason, getting passing the element handle from inside frame.evalute() passes null value, so i can't call ".click()". I thought, maybe i could get the boundaries, and call page.mouse.click() based on coordinated returned but that also fails.
Here's what I did, with comments as explainers:
////////////////////////////////////////
// iframe url is actually docs.google.com
////////////////////////////////////////
const frame = await page.frames().find(f => f.url().startsWith('https://docs.google.com/'));
let upload_tab = await frame.evaluate(() => {
let upload = [...document.querySelectorAll('div[role="tab"]')].filter(
d=>d.textContent=="Upload"
);
////////////////////////
// this works
////////////////////////
console.log(upload[0].id);
////////////////////////
// this works
////////////////////////
console.log(upload[0].textContent);
////////////////////////
// this fails
////////////////////////
console.log(upload[0].getBoundingClientRect());
////////////////////////////////////////////////
// this fails (return "upload_tab" outside of
// page.evalute() shows up as NULL)
////////////////////////////////////////////////
return upload[0];
});
////////////////////////
// this fails b/c upload_tab is null
////////////////////////
await upload_tab.click();
Questions:
how is it that upload[0].getBoundingClientRect() fails while upload[0].textContent is working?
how can I pass the element from frame to pass to puppeteer?
any suggestions as to workarounds?

puppeteer bypass cloudflare by enable cookies and Javascript

(In nodeJs -> server side only).
I'm doing some webscraping and some pages are protected by the cloudflare anti-ddos page. I'm trying to bypasse this page. By searching around I found a lot of article on the stealth methode or reCapcha. But the thing is cloudflare is not even trying to give me capcha, it keep being stuck on the page (wait for 5 secondes) because it display in red (TURN ON JAVASCRIPT AND RELOAD) and (TURN ON COOKIES AND RELOAD), by the way my javascript seems to be active because my programme run on a lot of website and it process the javascript.
This is my code:
//vm = this;
vm.puppeteer.use(vm.StealthPlugin())
vm.puppeteer.use(vm.AdblockerPlugin({
blockTrackers: true
}))
let browser = await vm.puppeteer.launch({
headless: true
});
let browserPage = await browser.newPage();
await browserPage.goto(link, {
waitUntil: 'networkidle2',
timeout: 40 * 1000
});
await browserPage.waitForTimeout(20 * 1000);
let body = await browserPage.evaluate(() => {
return document.documentElement.outerHTML;
});
I also try to delete stealthPlugin and AdblockerPlugin but cloodflare keeping telling me there is no javascript and cookies.
Can anyone help me please ?
Setting your own UserAgent and Accept-Language header should work because your headless browser needs to pretend like a real person who is browsing.
You can use page.setExtraHTTPHeaders() and page.setUserAgent() to do so.
await browserPage.setExtraHTTPHeaders({
'Accept-Language': 'en'
});
// You can use any UserAgent you want
await browserPage.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');

mimic chrome download requests in postman

In my chrome extension I am blocking user's downloads, and I am downloading the files in some other computer (some security app but it's not impotent for the post) based on the download url.
The idea works fine for a lot of cases, for example if user trying to download the file http://www.orimi.com/pdf-test.pdf with his chrome browser, the the extension blocks the download, sends the url to some other server and the server downloads the link.
I have problem with websites that asks some headers when downloading the file, is there a way to mimic the exact chrome request from other app?
I tried using chrome.webRequest.onBeforeSendHeaders.addListener to get all request headers and then to use those headers to download the file from other place (I mean not via chrome but via poatman) and I get Unauthorized
Here a small code example on the bacgroung script:
chrome.downloads.onCreated.addListener(function (e) {
console.log(`============= begin onCreated =============`);
console.log(e);
console.log(`============= end onCreated =============`);
});
chrome.webRequest.onBeforeSendHeaders.addListener(
function (details) {
console.log(`============= begin onBeforeSendHeaders =============`);
console.log(details);
console.log(`============= end onBeforeSendHeaders =============`);
return {
requestHeaders: details.requestHeaders
};
}, {
urls: ["<all_urls>"]
},
["blocking", "requestHeaders"]);
And when I am trying to download file from LinkedIn I get the output:
============= begin onBeforeSendHeaders =============
background.js:27 {frameId: 0, initiator: "https://www.linkedin.com", method: "GET", parentFrameId: -1, requestHeaders: Array(7), …}frameId: 0initiator: "https://www.linkedin.com"method: "GET"parentFrameId: -1requestHeaders: Array(7)0: {name: "Upgrade-Insecure-Requests", value: "1"}1: {name: "User-Agent", value: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb…ML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}2: {name: "Accept", value: "text/html,application/xhtml+xml,application/xml;q=…,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"}3: {name: "Sec-Fetch-Site", value: "same-origin"}4: {name: "Sec-Fetch-Mode", value: "navigate"}5: {name: "Sec-Fetch-User", value: "?1"}6: {name: "Sec-Fetch-Dest", value: "document"}length: 7__proto__: Array(0)requestId: "305371"tabId: 844timeStamp: 1596530356040.051type: "main_frame"url: "https://www.linkedin.com/dms/C4D06AQGz1fU0o0r3ZQ/messaging-attachmentFile/0?m=AQLmTdbXe5dgKgAAAXO4axnTdJJLiaU6EnZbq_fhQzg_1697ToPaTbJ3jw&ne=1&v=beta&t=QCDqVeorWfXEAgQBCQdo9hbEQrxwM97zqzCvLuBE2Cw#S6555100544749322240_500"__proto__: Object
background.js:28 ============= end onBeforeSendHeaders =============
background.js:20 ============= begin onCreated =============
background.js:21 {bytesReceived: 0, canResume: false, danger: "safe", exists: true, fileSize: 0, …}bytesReceived: 0canResume: falsedanger: "safe"exists: truefileSize: 0filename: ""finalUrl: "https://www.linkedin.com/dms/D5D06AQGz1fU0o0r3ZK/messaging-attachmentFile/0?m=AQLpTdbXe5dgKgAAAXO4axnTdJJLiaU6EnZbe_fhQzg_1697ToPaTbJ3jw&ne=1&v=beta&t=QCDqVeorWfXEAgQBCQdo9hbEQrxwM97zqzCvLuBE2Cw#S6555100544749322246_500"id: 7100incognito: falsemime: "application/octet-stream"paused: falsereferrer: "https://www.linkedin.com/in/natali-melman-a785a349/"startTime: "2020-08-04T08:39:16.060Z"state: "in_progress"totalBytes: 0url: "https://www.linkedin.com/dms/D5D06AQGz1fU0o0r3ZK/messaging-attachmentFile/0?m=AQLpTdbXe5dgKgAAAXO4axnTdJJLiaU6EnZbe_fhQzg_1697ToPaTbJ3jw&ne=1&v=beta&t=QCDqVeorWfXEAgQBCQdo9hbEQrxwM97zqzCvLuBE2Cw#S6555100544749322256_500"__proto__: Object
background.js:22 ============= end onCreated =============
I tried to mimic this exact request in postman (same header, parameters etc) but I get 401 error code

Why does http.get take so long in NodeJS?

Using NodeJS and http.get, I am trying to see if a website uses a redirect. I tried a few URLs which all worked great. However, when I ran the code with washingtonpost.com it took over 5 seconds. In my browser the website works just fine. What could be the issue?
console.time("Done. Script executed in");
const http = require("http");
function checkRedirectHttp(input){
return new Promise((resolve) => {
http.get(input, {method: 'HEAD'}, (res) => { resolve([res.headers.location, res.statusCode]) })
.on('error', (e) => { throw {Error: `Cannot reach website ${input}`} });
});
};
checkRedirectHttp("http://www.washingtonpost.com/").then(result => {
console.log(result);
console.timeEnd("Done. Script executed in");
})
Output:
[
'http://www.washingtonpost.com/gdpr-consent/?next_url=https%3a%2f%2fwww.washingtonpost.com%2f',
302
]
Done. Script executed in: 8.101s
I ran your code, enhanced it some and slowly added back the actual headers that are sent from my browser when I go to the same link in the browser. When I changed the request to a "GET" (no longer a "HEAD") and added the following headers from my browser:
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cookie": "a very long cookie here"
then the response went from 9 seconds to 71ms.
So, apparently the server doesn't like the HEAD request and doesn't like that a bunch of headers it expects to be there are missing. Probably, it is detecting that this isn't a real browser and it's either analyzing something for 8 seconds or it's just purposely delaying a response to a "fake client".
Also, if you use the http://www.washingtonpost.com URL instead of https://www.washingtonpost.com, it redirects to https every time for me. So, you may as well just start with the https:// form of the URL.

I'm requesting html content from a site with axios in JS but the website is blocking my request

I want my script to pull the html data from a site, but it is returning a page that says it knows my script is a bot and giving it an 'I am not a robot' test to pass.
Instead of returning the content of the site it returns a page that partly reads...
"As you were browsing, something about your browser\n made us think you were a bot."
My code is...
const axios = require('axios');
const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
axios(url, {headers: {
'Mozilla': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.3 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/43.4.0',
}})
.then(response => {
const html = response.data;
console.log(html)
})
.catch(console.error);
I've tried a few different headers, but there's no fooling the site into thinking my script is human. This is in NodeJS.
Maybe this does or does not have bearing on my issue, but this code will hopefully live on the backend of my site in React I'm building. I'm not trying to scrape the site as a one off. I would like my site to read from this site for a little bit of content, instead of having to manually update my site with bits of content on this site whenever it changes.
Accessing every site using axios or curl is not possible. There are various kinds of checks including CORS that can prevent someone to access a site directly via a client other than the browser.
You can achieve the same using phantom (https://www.npmjs.com/package/phantom). This is commonly used by scrapers and if you're afraid that the other site may block you for repeated access, you can use a random interval before making requests. If you need to read something from the returned HTML page, you can use cheerio (https://www.npmjs.com/package/cheerio).
Hope it helps.
Below is the code that I tried and worked for your URL:
const phantom = require('phantom');
(async () => {
const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
const instance = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = await instance.createPage();
const status = await page.open(url);
if (status !== 'success') {
console.error(status);
await instance.exit();
return;
}
const content = await page.property('content');
await instance.exit();
console.log(content);
})();

Resources