Incredibly weird puppeteer behaviour - node.js

I use this code:
// <-- add event on top of file
process.on("unhandledRejection", (reason, p) => {
console.error("Unhandled Rejection at: Promise", p, "reason:", reason);
// browser.close(); // <-- no need to close the browser here
});
const puppeteer = require('puppeteer');
async function getPic() {
try{ // <-- wrap the whole block in try catch
const browser = await puppeteer.launch(/*{headless: false}*/);
const page = await browser.newPage();
await page.setViewport({width: 1000, height: 500}); // <-- add await here so it sets viewport after it creates the page
//await page.goto('https://www.google.com'); //Old way of doing. It doesn't work for some reason...
page.goto('https://www.google.com/').catch(error => console.log("error on page.goto()", error));
// wait for either of events to trigger
await Promise.race([
page.waitForNavigation({waitUntil: 'domcontentloaded'}),
page.waitForNavigation({waitUntil: 'load'})
]);
await page.screenshot({path: 'pic.png'});
await browser.close(); // <-- close browser after everything is done
} catch (error) {
console.log(error);
}
}
getPic();
Then the program hangs. After 30 seconds, i get this error:
error on page.goto() Error: Navigation T imeout Exceeded: 30000ms exceeded at Promise.then (C:\...\pupet test\node_modules\puppeteer\lib\NavigatorWatcher.js:71:21) at <anonymous>
But i also get the picture i requested!
1.So how is it that page.goto() fails but it still gets the picture, which mean that page.goto() actually worked!?
2. What can i do to mitigate this weird error?

The program hangs because you called goto without async-await or promises, then you put it in a race for waitForNavigation, this makes the browser confused because all three line of code is mostly doing same thing on the back. It is trying to navigate and wait for it.
Use async await for promises. Do not call async methods synchronous ways. No matter what, this is how you must use it in your example case.
await page.goto('https://www.google.com');
If you want to wait until page load, then the goto function has that covered too. You don't need to use the waitForNavigation after goto.
await page.goto('https://www.google.com', {waitUntil: 'load'});
There is also domcontentloaded, networkidle2, networkidle0 for the waitUntil property. You can read more about it in the docs with full explanation.
The reason why screenshot is working properly is because it's getting executed asynchronously but then you are awaiting for the navigation later on.
Here is the code without much complexity and promise race.
try{ // <-- wrap the whole block in try catch
const browser = await puppeteer.launch(/*{headless: false}*/);
const page = await browser.newPage();
await page.setViewport({width: 1000, height: 500}); // <-- add await here so it sets viewport after it creates the page
await page.goto('https://www.google.com/', {waitUntil: 'load'})
await page.screenshot({path: 'pic.png'});
await browser.close(); // <-- close browser after everything is done
} catch (error) {
console.log(error);
}
Here is how it works perfectly on the sandbox.
The puppeteer docs is a good place to start to learn about this.

Related

Getting changes in object with puppeteer

I'm trying to learn how to track changes in a div. I found a post that showed the following code:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.time.ir', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digitalClock').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
})();
When running this it pulls the time from the webpage and every second console.logs the new time - exactly what I'm looking for. However I'm having issues with any other page for some reason. For example, the very similar code below gives me an error:
'node:1801) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: $$ is not defined'
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.clocktab.com', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digit2').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
I'm not sure the difference between them other than the page I navigate to, and the element that I'm looking at to find the changing value. Additionally, I did read somewhere that DOMSubtreeModified is deprecated now, so if there's a better way to get what I'm looking for that would be great!
Thanks in advance
The difference is that in the second website there is not jquery, and when you send the evaluation function, $ is not defined.
Replace with vanilla js:
document.querySelector('#digit2').addEventListener ("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
})
Suggestion: when i debug with the puppeeter evaluation function i copy-paste this on my browser console in the page. For example:

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

Puppeteer (Cluster) closing page when I interact with it

In a NodeJS v10.x.x environment, when trying to create a PDF page from some HTML code, I'm getting a closed page issue every time I try to do something with it (setCacheEnabled, setRequestInterception, etc...):
async (page, data) => {
try {
const {options, urlOrHtml} = data;
const finalOptions = { ...config.puppeteerOptions, ...options };
// Set caching flag (if provided)
const cache = finalOptions.cache;
if (cache != undefined) {
delete finalOptions.cache;
await page.setCacheEnabled(cache); //THIS LINE IS CAUSING THE PAGE TO BE CLOSED
}
// Setup timeout option (if provided)
let requestOptions = {};
const timeout = finalOptions.timeout;
if (timeout != undefined) {
delete finalOptions.timeout;
requestOptions.timeout = timeout;
}
requestOptions.waitUntil = 'networkidle0';
if (urlOrHtml.match(/^http/i)) {
await page.setRequestInterception(true); //THIS LINE IS CAUSING ERROR DUE TO THE PAGE BEING ALREADY CLOSED
page.once('request', request => {
if(finalOptions.method === "POST" && finalOptions.payload !== undefined) {
request.continue({method: 'POST', postData: JSON.stringify(finalOptions.payload)});
}
});
// Request is for a URL, so request it
await page.goto(urlOrHtml, requestOptions);
}
return await page.pdf(finalOptions);
} catch (err) {
logger.info(err);
}
};
I read somewhere that this issue could be caused due to some await missing, but that doesn't look like my case.
I'm not using directly puppeteer, but this library that creates a cluster on top of it and handles processes:
https://github.com/thomasdondorf/puppeteer-cluster
You already gave the solution, but as this is a common problem with the library (I'm the author 🙂) I would like to provide some more insights.
How the task function works
When a job is queued and ready to be executed, puppeteer-cluster will create a page and call the task function (given to cluster.task) with the created page object and the queued data. The cluster then waits until the Promise is finished (fulfilled or rejected) and will close the page and execute the next job in the queue.
As an async-function is implicitly creating a Promise, this means as soon as the async-function given to the cluster.task function is finished, the page is closed. There is no magic happening to determine if the page might be used in the future.
Waiting for asynchronous events
Below is a code sample with a common mistake. The user might want to wait for an external event before closing the page as in the (not working) example below:
Non-working (!) code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
setTimeout(() => { // user is waiting for an asynchronous event
await page.evaluate(/* ... */); // Will throw an error as the page is already closed
}, 1000);
});
In this code, the page is already closed before the asynchronous function is executed. To correct way to do this would be to return a Promise instead.
Working code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
// will wait until the Promise resolves
await new Promise(resolve => {
setTimeout(() => { // user is waiting for an asynchronous event
try {
await page.evalute(/* ... */);
resolve();
} catch (err) {
// handle error
}
}, 1000);
});
});
In this code sample, the task function waits until the inner promise is resolved until it resolves the function. This will keep the page open until the asynchronous function calls resolve. In addition, the code uses a try..catch block as the library is not able to catch events thrown inside asynchronous code blocks.
I got it.
I was indeed forgetting an await to the call that was made to the function I posted.
That call was in another file that I use fot the cluster instance creation:
async function createCluster() {
//We will protect our app with a Cluster that handles all the processes running in our headless browser
const cluster = await Cluster.launch({
concurrency: Cluster[config.cluster.concurrencyModel],
maxConcurrency: config.cluster.maxConcurrency
});
// Event handler to be called in case of problems
cluster.on('taskerror', (err, data) => {
console.log(`Error on cluster task... ${data}: ${err.message}`);
});
// Incoming task for the cluster to handle
await cluster.task(async ({ page, data }) => {
main.postController(page, data); // <-- I WAS MISSING A return await HERE
});
return cluster;
}

Puppeteer close javascript alert box

I'm trying to click on a page button on this website but when I enter the site an alert box shows up and I don't know how to close it.
I just started experimenting with Puppeteer, this is the code I'm using this simple code right now:
const ptr = require('puppeteer');
ptr.launch().then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto('https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/');
//This is the alert button selector
await page.click("#BoxAlertBtnOk");
//This is the button on the page i want to click on
await page.click("input[value='Perito RE / G.F.']");
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
await browser.close();
});
This is the error I get: UnhandledPromiseRejectionWarning: Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint
Any help would be really appreciated, thanks!
There are few things going on that page,
The alert box only loads after page is loaded (It has a onload property on body tag). So you should wait until network is idle.
Clicking those "Perito" buttons creates a new window/tab due to the window.open() code put into onclick handler.
The new tab redirects multiple times and shows a login page if the user is not logged in already.
Solution:
1. Make sure to load the page properly.
Just add { waitUntil: "networkidle0" } to .goto or .waitForNavigation.
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
2. Wait for the element before clicking
Already suggested on other answers, wait for the element using waitFor.
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
3. Optional, add few seconds before taking screenshot after clicking the button.
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
4. Use the page.evaluate and document.querySelector to get element
page.click will not handle all kind of clicks. Sometimes there are different events bound to some elements and you have to treat that separately.
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
5. Treat the new tab separately
Together with browser.once('targetcreated'), new Promise, and browser.pages() you can catch the newly created tab and work on it.
Note: Read final code at end of the answer before using this.
// this is the final page after clicking the input on previous page
// https://italy.grupporealemutua.it/FIM/sps/IDPRMA/saml20/login
function newTabCatcher(browser) {
// we resolve this promise after doing everything we need to do on this page
// or in error
return new Promise((resolve, reject) => {
// set the listener before clicking the button to have proper interaction
// we listen for only one new tab
browser.once("targetcreated", async function() {
console.log("New Tab Created");
try {
// get the newly created window
const tabs = await browser.pages();
const lastTab = tabs[tabs.length - 1];
// Wait for navigation to finish as well as specific login form
await Promise.all([
lastTab.waitForNavigation({ waitUntil: "networkidle0" }),
lastTab.waitFor("#div_login")
]);
// browser will switch to this tab just when it takes the screenshot
await lastTab.screenshot({
path: "screenshot_newtab.png",
fullPage: true
});
resolve(true);
} catch (error) {
reject(error);
}
});
});
}
Final Code:
Just for clarity, here is how I used all code snippets specified above.
const ptr = require("puppeteer");
ptr.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
// here we go and process the new tab
// aka get screenshot, fill form etc
await newTabCatcher(browser);
// rest of your code
// ...
await browser.close();
});
Result:
It worked flawlessly!
Note:
Notice how I used new Promise and async await together. This might not be the best practice, but now you have a lead of what to look for when creating a scraper for some old websites.
If it's relevant to anyone else who facing dialog boxes, the following code solved it for me:
this.page.on('dialog', async dialog => {
await dialog.dismiss();
});
Your button - #BoxAlertBtnOk will be appear on the webpage after a moment, when you call await page.click("#BoxAlertBtnOk"); the button is invisible. Try to wait until it visible then take an action:
await page.waitForSelector("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
await page.waitForSelector("input[value='Perito RE / G.F.']");
await page.click("input[value='Perito RE / G.F.']");

Catch multiple xhr responses on multiple page.click (without page reload or change)

page.on('response', response => {
// allow only XHR
if ('xhr' !== response.request().resourceType()){
return ;
}
console.log(response.url());
});
await page.click('span#first');
await page.click('span#second');
await page.click('span#third');
As result of the example above, just ajax requests corresponding to the span#first are caught by page.on . The span#second and span#third are not caught. Usage of page.waitForNavigation seems have no effect.

Resources