How to find / click elements inside iframe in puppeteer? - node.js

Setup:
puppeteer - puppeteer#13.7.0
nodejs - v10.19.0
Puppetter launch setup:
"args" :[
'--ignore-certificate-errors',
'--ignore-certifcate-errors-spki-list',
'--no-sandbox',
'--disable-gpu',
'--start-maximized',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--window-size=1200,800',
'--single-process',
'--disable-infobars',
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process',
'--window-position=0,0',
'--no-zygote',
'--no-sandbox',
'--user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36"',
]
I'm trying to upload image onto Gmail -
The "image uploader" is iframe that gets dynamically loaded when you click on "insert photo" icon in the panel where you type receipient / subject / body / etc.
When I query the iframe, I can find the div (including searching by textContent, and getting the ID) but for some reason, theres two things i cannot do
1) get the bounding rect
2) return the element handle to puppetteer
For some reason, getting passing the element handle from inside frame.evalute() passes null value, so i can't call ".click()". I thought, maybe i could get the boundaries, and call page.mouse.click() based on coordinated returned but that also fails.
Here's what I did, with comments as explainers:
////////////////////////////////////////
// iframe url is actually docs.google.com
////////////////////////////////////////
const frame = await page.frames().find(f => f.url().startsWith('https://docs.google.com/'));
let upload_tab = await frame.evaluate(() => {
let upload = [...document.querySelectorAll('div[role="tab"]')].filter(
d=>d.textContent=="Upload"
);
////////////////////////
// this works
////////////////////////
console.log(upload[0].id);
////////////////////////
// this works
////////////////////////
console.log(upload[0].textContent);
////////////////////////
// this fails
////////////////////////
console.log(upload[0].getBoundingClientRect());
////////////////////////////////////////////////
// this fails (return "upload_tab" outside of
// page.evalute() shows up as NULL)
////////////////////////////////////////////////
return upload[0];
});
////////////////////////
// this fails b/c upload_tab is null
////////////////////////
await upload_tab.click();
Questions:
how is it that upload[0].getBoundingClientRect() fails while upload[0].textContent is working?
how can I pass the element from frame to pass to puppeteer?
any suggestions as to workarounds?

Related

Xpath: Working in Devtools but returns empty object with document.evaluate()

The following command works as expected in Devtools console (for example, here - https://news.ycombinator.com/)
$x('//a[#class="storylink"]') (Edge browser)
But the following code:
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com/");
let urls = await page.evaluate(() => {
var item = document.evaluate(
'//a[#class="storylink"]',
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
return item
})
browser.close();
returns an empty object: {}. The same is happening in every other website.
Why is that happening?
If you are automatic Chrome with Puppeteer from Node.js then the page object you have already exposes a method $x for XPath evaluation, see https://pptr.dev/#?product=Puppeteer&version=v10.4.0&show=api-pagexexpression. That means that doing
let urls = await page.$x('//a[#class="storylink"]')
should suffice.

Pupateer page.evaluate randomly stopped work when parsing website

i created a webparser for option alerts about 3 weeks ago and all was going well, as of today I checked on it and for some reason it was returning empty values, I thought maybe the website reformated but there is nothing different, i have been trying many fixes for the last hours so hoping I could get some help, below is the code I use for parising the website:
const browser = await puppeteer.launch({
args: ['--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu'],
dumpio: true,
headless: true
});
const page = await browser.newPage();
await page.goto(process.env.ALERTS_PARSER_WEBSITE);
// page.on("console", msg => console.log("PAGE LOG:", msg));
const data = await page.evaluate(() =>
Array.from(document.querySelectorAll("table > tbody > tr"), (row) =>
Array.from(row.querySelectorAll("th, td"), (cell) => cell.innerText)
)
);
And then I map the data into my own array and pass back to my front-end, the website I am trying to parse from is Bar Chart Unusual Options Activity. You can inspect the site there and see that the query selector should work, im really on my last leg on this one so any help would be greatly appreciated.
Not sure what the cause may be, but I manage to get the data only with puppeteer.launch({ headless: false }); and
page.setDefaultTimeout(300_000);
// ...
await page.waitForSelector("table > tbody > tr");
(the last may be needed only on slow machines like my one).
Maybe the site starts using some protection against headless mode.
P.S. When I try to get a page screenshot in headless mode, I instantly get this:
P.P.S. It seems the solution is simple for now. As response.request().redirectChain() is empty, the site only checks the user agent header in the first request. So this seems to fix the issue for the headless mode (the difference can be inferred from comparing await browser.userAgent() values in both mode):
await page.setUserAgent((await browser.userAgent()).replace('HeadlessChrome', 'Chrome'));
await page.goto('https://www.barchart.com/options/unusual-activity/stocks?orderBy=tradeTime&orderDir=desc');

puppeteer bypass cloudflare by enable cookies and Javascript

(In nodeJs -> server side only).
I'm doing some webscraping and some pages are protected by the cloudflare anti-ddos page. I'm trying to bypasse this page. By searching around I found a lot of article on the stealth methode or reCapcha. But the thing is cloudflare is not even trying to give me capcha, it keep being stuck on the page (wait for 5 secondes) because it display in red (TURN ON JAVASCRIPT AND RELOAD) and (TURN ON COOKIES AND RELOAD), by the way my javascript seems to be active because my programme run on a lot of website and it process the javascript.
This is my code:
//vm = this;
vm.puppeteer.use(vm.StealthPlugin())
vm.puppeteer.use(vm.AdblockerPlugin({
blockTrackers: true
}))
let browser = await vm.puppeteer.launch({
headless: true
});
let browserPage = await browser.newPage();
await browserPage.goto(link, {
waitUntil: 'networkidle2',
timeout: 40 * 1000
});
await browserPage.waitForTimeout(20 * 1000);
let body = await browserPage.evaluate(() => {
return document.documentElement.outerHTML;
});
I also try to delete stealthPlugin and AdblockerPlugin but cloodflare keeping telling me there is no javascript and cookies.
Can anyone help me please ?
Setting your own UserAgent and Accept-Language header should work because your headless browser needs to pretend like a real person who is browsing.
You can use page.setExtraHTTPHeaders() and page.setUserAgent() to do so.
await browserPage.setExtraHTTPHeaders({
'Accept-Language': 'en'
});
// You can use any UserAgent you want
await browserPage.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');

Puppeteer page request fails only on AWS EC2 instance

I've written a small javascript program using node (v12.16.2) and puppeteer (v2.1.1) that I'm trying to run on an AWS EC2 instance. I'm doing a goto of the url appended to this. It works fine on a local (non-AWS) linux machine with similar versions, but on the EC2, it fails, not showing the page at all. I've tried running with headless=false and devtools=true. In the browser console, I see this:
Uncaught TypeError: Cannot read property 'length' of undefined
at il_Ev (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1862)
at il_Hv (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1849)
at il_Yv.initialize (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1867)
at il__i (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:270)
at il_Gl.il_Wj.H (rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:322)
at rs=ACT90oFtPziyty36T_zhgMUEStuCtJgAkQ:1869
As I mentioned, this same code works fine on a different linux machine and just loaded inside a browser; no errors. I'm stumped. Does anyone know what might be going on? Other pages, like google.com, load fine in the EC2, FYI. TIA.
Reid
https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg-s-msn-com.akamaized.net%2Ftenant%2Famp%2Fentityid%2FAACPW4S.img%3Fh%3D552%26w%3D750%26m%3D6%26q%3D60%26u%3Dt%26o%3Df%26l%3Df%26x%3D992%26y%3D672&imgrefurl=https%3A%2F%2Fwww.msn.com%2Fen-us%2Flifestyle%2Fpets-animals%2F49-adorable-puppy-pictures-that-will-make-you-melt%2Fss-AACSrEY&tbnid=Ad7wBCCmAXPRDM&vet=12ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw..i&docid=jawDJ74qdYREJM&w=750&h=500&q=puppies&ved=2ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw
Here's an excerpt of the relevant code, which is pretty simple:
const browser = await puppeteer.launch({
headless: false,
devtools: true,
slowMo: 150
});
await browser.userAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
);
/* Get the first page rather than creating a new one unnecessarily. */
let page = (await browser.pages())[0];
await page.setViewport({
width: 1524,
height: 768
});
try {
await page.goto("https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg-s-msn-com.akamaized.net%2Ftenant%2Famp%2Fentityid%2FAACPW4S.img%3Fh%3D552%26w%3D750%26m%3D6%26q%3D60%26u%3Dt%26o%3Df%26l%3Df%26x%3D992%26y%3D672&imgrefurl=https%3A%2F%2Fwww.msn.com%2Fen-us%2Flifestyle%2Fpets-animals%2F49-adorable-puppy-pictures-that-will-make-you-melt%2Fss-AACSrEY&tbnid=Ad7wBCCmAXPRDM&vet=12ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw..i&docid=jawDJ74qdYREJM&w=750&h=500&q=puppies&ved=2ahUKEwig1NfB0Y7oAhXGHc0KHSzuCMUQMygeegQIARBw", {
timeout: 0,
// waitUntil: ["load"]
// waitUntil: ["networkidle2"]
});
await page.waitForSelector('#irc_shc', {
visible: true,
timeout: 0
});
} catch(e) {
console.log("error: e = ", e);
}
This was just a temporary google page error, it turns out.

When web scraping/testing how do I get passed the notifications popup?

Goal: I'm trying to scrape pictures from instagram using Puppeteer to programmatically log in to my account and start mining data.
The issue: I can log in fine but then I get hit with a popup asking if I want notifications (I turned headless off to see this in action). I'm following the example code for this found here: https://github.com/checkly/puppeteer-examples/blob/master/3.%20login/instagram.js which uses the below try block to find the notification popup and click the 'Not now' button.
//check if the app asks for notifications
try {
await loginPage.waitForSelector(".aOOlW.HoLwm",{
timeout:5000
});
await loginPage.click(".aOOlW.HoLwm");
} catch (err) {
}
The problem is it doesn't actually click the 'Not now' button so my script is stuck in limbo. The selector is pointing to the right div so what gives?
can you please try enabling "notifications" using browserContext.overridePermissions?
you can override notification. This is the code that would, for example, disable the Allow Notifications popup when logging into facebook.
let crawl = async function(){
let browser = await puppeteer.launch({ headless:false });
const context = browser.defaultBrowserContext();
// URL An array of permissions
context.overridePermissions("https://www.facebook.com", ["geolocation", "notifications"]);
let page = await browser.newPage();
await page.goto("https://www.facebook.com");
await page.type("#email", process.argv[2]);
await page.type("#pass", process.argv[3]);
await page.click("#u_0_2");
await page.waitFor(1000);
await page.waitForSelector("#pagelet_composer");
let content2 = await page.$$("#pagelet_composer");
console.log(content2); // .$$ An array containing elementHandles .$ would return 1 elementHandle
}
crawl();

Resources