Node.JS + Puppeteer: browser.close freezes process - node.js

I have a really simple function I'm running.
module.exports.test = async () => {
const browser = await puppeteer.launch(puppetOptions);
try {
const page = await browser.newPage();
await page.goto('https://google.com');
// helper function that pauses for five seconds before moving on
await pause(5000);
await browser.close();
console.log('browser closed');
} catch (err) {
console.log(err);
await browser.close();
}
}
I run it from my index.js file:
const server = app.listen(process.env.PORT || 5000, () => {
test();
});
Now I run it in my terminal with node index.js. It opens a browser. It opens a new page and navigates to Google. It waits five seconds. It closes the browser. Everything appears great. I hit ctrl + c in my terminal to stop the process, but nothing happens. Typically this works. If I remove the browser.close function, ctrl + c goes back to working as expected, and ends the process. This function I'm running is the result of me breaking down a more complex function that appears to have a memory leak, so it really seems that browser.close is the culprit. For the life of me though, I can't figure out why it would be causing an issue when simplified this much. This is happening in both headless, and headfull modes. Here are the puppeteer launch options:
puppetOptions = {
defaultViewport: null,
args: [
"--incognito",
"--no-sandbox",
"--single-process",
"--no-zygote"
],
}
puppetOptionsHeadfull = {
headless: false,
executablePath: 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe',
}
EDIT: I tried this in my bash terminal as well and have the same issue. When I try to manually close the terminal to abort it, an error pops up.
Processes are running in session:
| WPID PID COMMAND
| 10900 1122 winpty node.exe index.js
Close anyway?
EDIT 2: Narrowed this down to it being an issue with puppeteer-extra most likely, however, it seems like a bug in the core package. Fairly recent open issue on their repo that reflects this bug found here: https://github.com/berstend/puppeteer-extra/issues/421
I'll leave this question open just in case anyone else stumbles on it with the same issue, they don't pull their hair out debugging it.

I have been having this issue as well. It seems for me that the two arguments '--no-sandbox', and '--disable-setuid-sandbox' are culprits, although it's not intuitive why that would be the case. Try removing the no sandbox argument

Related

Puppeteer pdf image not rendering correctly [duplicate]

I am working on creating PDF from web page.
The application on which I am working is single page application.
I tried many options and suggestion on https://github.com/GoogleChrome/puppeteer/issues/1412
But it is not working
const browser = await puppeteer.launch({
executablePath: 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
ignoreHTTPSErrors: true,
headless: true,
devtools: false,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
await page.goto(fullUrl, {
waitUntil: 'networkidle2'
});
await page.type('#username', 'scott');
await page.type('#password', 'tiger');
await page.click('#Login_Button');
await page.waitFor(2000);
await page.pdf({
path: outputFileName,
displayHeaderFooter: true,
headerTemplate: '',
footerTemplate: '',
printBackground: true,
format: 'A4'
});
What I want is to generate PDF report as soon as Page is loaded completely.
I don't want to write any type of delays i.e. await page.waitFor(2000);
I can not do waitForSelector because the page has charts and graphs which are rendered after calculations.
Help will be appreciated.
You can use page.waitForNavigation() to wait for the new page to load completely before generating a PDF:
await page.goto(fullUrl, {
waitUntil: 'networkidle0',
});
await page.type('#username', 'scott');
await page.type('#password', 'tiger');
await page.click('#Login_Button');
await page.waitForNavigation({
waitUntil: 'networkidle0',
});
await page.pdf({
path: outputFileName,
displayHeaderFooter: true,
headerTemplate: '',
footerTemplate: '',
printBackground: true,
format: 'A4',
});
If there is a certain element that is generated dynamically that you would like included in your PDF, consider using page.waitForSelector() to ensure that the content is visible:
await page.waitForSelector('#example', {
visible: true,
});
Sometimes the networkidle events do not always give an indication that the page has completely loaded. There could still be a few JS scripts modifying the content on the page. So watching for the completion of HTML source code modifications by the browser seems to be yielding better results. Here's a function you could use -
const waitTillHTMLRendered = async (page, timeout = 30000) => {
const checkDurationMsecs = 1000;
const maxChecks = timeout / checkDurationMsecs;
let lastHTMLSize = 0;
let checkCounts = 1;
let countStableSizeIterations = 0;
const minStableSizeIterations = 3;
while(checkCounts++ <= maxChecks){
let html = await page.content();
let currentHTMLSize = html.length;
let bodyHTMLSize = await page.evaluate(() => document.body.innerHTML.length);
console.log('last: ', lastHTMLSize, ' <> curr: ', currentHTMLSize, " body html size: ", bodyHTMLSize);
if(lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize)
countStableSizeIterations++;
else
countStableSizeIterations = 0; //reset the counter
if(countStableSizeIterations >= minStableSizeIterations) {
console.log("Page rendered fully..");
break;
}
lastHTMLSize = currentHTMLSize;
await page.waitForTimeout(checkDurationMsecs);
}
};
You could use this after the page load / click function call and before you process the page content. e.g.
await page.goto(url, {'timeout': 10000, 'waitUntil':'load'});
await waitTillHTMLRendered(page)
const data = await page.content()
In some cases, the best solution for me was:
await page.goto(url, { waitUntil: 'domcontentloaded' });
Some other options you could try are:
await page.goto(url, { waitUntil: 'load' });
await page.goto(url, { waitUntil: 'domcontentloaded' });
await page.goto(url, { waitUntil: 'networkidle0' });
await page.goto(url, { waitUntil: 'networkidle2' });
You can check this at puppeteer documentation:
https://pptr.dev/#?product=Puppeteer&version=v11.0.0&show=api-pagewaitfornavigationoptions
I always like to wait for selectors, as many of them are a great indicator that the page has fully loaded:
await page.waitForSelector('#blue-button');
In the latest Puppeteer version, networkidle2 worked for me:
await page.goto(url, { waitUntil: 'networkidle2' });
Wrap the page.click and page.waitForNavigation in a Promise.all
await Promise.all([
page.click('#submit_button'),
page.waitForNavigation({ waitUntil: 'networkidle0' })
]);
I encountered the same issue with networkidle when I was working on an offscreen renderer. I needed a WebGL-based engine to finish rendering and only then make a screenshot. What worked for me was a page.waitForFunction() method. In my case the usage was as follows:
await page.goto(url);
await page.waitForFunction("renderingCompleted === true")
const imageBuffer = await page.screenshot({});
In the rendering code, I was simply setting the renderingCompleted variable to true, when done. If you don't have access to the page code you can use some other existing identifier.
You can also use to ensure all elements have rendered
await page.waitFor('*')
Reference: https://github.com/puppeteer/puppeteer/issues/1875
As for December 2020, waitFor function is deprecated, as the warning inside the code tell:
waitFor is deprecated and will be removed in a future release. See
https://github.com/puppeteer/puppeteer/issues/6214 for details and how
to migrate your code.
You can use:
sleep(millisecondsCount) {
if (!millisecondsCount) {
return;
}
return new Promise(resolve => setTimeout(resolve, millisecondsCount)).catch();
}
And use it:
(async () => {
await sleep(1000);
})();
Keeping in mind the caveat that there's no silver bullet to handle all page loads, one strategy is to monitor the DOM until it's been stable (i.e. has not seen a mutation) for more than n milliseconds. This is similar to the network idle solution but geared towards the DOM rather than requests and therefore covers a different subset of loading behaviors.
Generally, this code would follow a page.waitForNavigation({waitUntil: "domcontentloaded"}) or page.goto(url, {waitUntil: "domcontentloaded"}), but you could also wait for it alongside, say, waitForNetworkIdle() using Promise.all() or Promise.race().
Here's a simple example:
const puppeteer = require("puppeteer"); // ^14.3.0
const waitForDOMStable = (
page,
options={timeout: 30000, idleTime: 2000}
) =>
page.evaluate(({timeout, idleTime}) =>
new Promise((resolve, reject) => {
setTimeout(() => {
observer.disconnect();
const msg = `timeout of ${timeout} ms ` +
"exceeded waiting for DOM to stabilize";
reject(Error(msg));
}, timeout);
const observer = new MutationObserver(() => {
clearTimeout(timeoutId);
timeoutId = setTimeout(finish, idleTime);
});
const config = {
attributes: true,
childList: true,
subtree: true
};
observer.observe(document.body, config);
const finish = () => {
observer.disconnect();
resolve();
};
let timeoutId = setTimeout(finish, idleTime);
}),
options
)
;
const html = `<!DOCTYPE html><html lang="en"><head>
<title>test</title></head><body><h1></h1><script>
(async () => {
for (let i = 0; i < 10; i++) {
document.querySelector("h1").textContent += i + " ";
await new Promise(r => setTimeout(r, 1000));
}
})();
</script></body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setContent(html);
await waitForDOMStable(page);
console.log(await page.$eval("h1", el => el.textContent));
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
For pages that continually mutate the DOM more often than the idle value, the timeout will eventually trigger and reject the promise, following the typical Puppeteer fallback. You can set a more aggressive overall timeout to fit your needs or tailor the logic to ignore (or only monitor) a particular subtree.
Answers so far haven't mentioned a critical fact: it's impossible to write a one-size-fits-all waitUntilPageLoaded function that works on every page. If it were possble, Puppeteer would surely provide it.
Such a function can't rely on a timeout, because there's always some page that takes longer to load than that timeout. As you extend the timeout to reduce the failure rate, you introduce unnecessary delays when working with fast pages. Timeouts are generally a poor solution, opting out of Puppeteer's event-driven model.
Waiting for idle network requests might not always work if the responses involve long-running DOM updates that take longer than 500ms to trigger a render.
Waiting for the DOM to stop changing might miss slow network requests, long-delayed JS triggers, or ongoing DOM manipulation that might cause the listener never to settle, unless specially handled.
And, of course, there's user interaction: captchas, prompts and cookie/subscription modals that need to be clicked through and dismissed before the page is in a sensible state for a full-page screenshot (for example).
Since every page has different, arbitrary JS behavior, the typical approach is to write event-driven logic that works for a specific page. Making precise, directed assumptions is much better than cobbling together a boatload of hacks that tries to solve every edge case.
If your use case is to write a load event that works on every page, my suggestion is to use some combination of the tools described here that is most balanced to meet your needs (speed vs. accuracy, development time/code complexitiy vs accuracy, etc). Use fail-safes for everything rather than blindly assuming all pages will cooperate with your assumptions. Think hard about what extent you really need to try to handle every web page. Prepare to compromise and accept some degree of failures you can live with.
Here's a quick rundown of the strategies you can mix and match to wait for loads to fit your needs:
page.goto() and page.waitForNavigation() default to the load event, which "is fired when the whole page has loaded, including all dependent resources such as stylesheets and images" (MDN), but this is often too pessimistic; there's no need to wait for a ton of data you don't care about. Often the data is available without waiting for all external resources, so domcontentloaded should be faster. See my post Avoiding Puppeteer Antipatterns for further discussion.
On the other hand, if there are JS-triggered networks requests after load, you'll miss that data. Hence networkidle2 and networkidle0, which wait 500 ms after the number of active network requests are 2 or 0. The motivation for the 2 version is that some sites keep ongoing requests open, which would cause networkidle0 to time out.
If you're waitng for a specific network response that might have a payload (or, for the general case, implementing your own network idle monitor), use page.waitForResponse(). page.waitForRequest(), page.waitForNetworkIdle() and page.on("request", ...) are also useful here.
If you're waiting for a particular selector to be visible, use page.waitForSelector(). If you're waiting for a load on a specific page, identify a selector that indicates the state you want to wait for. Generally speaking, for scripts specific to one page, this is the main tool to wait for the state you want, whether you're extracting data or clicking something. Frames and shadow roots thwart this function.
page.waitForFunction() lets you wait for an arbitrary predicate, for example, checking that the page's HTML or a specific list is a certain length. It's also useful for quickly dipping into frames and shadow roots to wait for predicates that depend on nested state. This function is also handy for detecting DOM mutations.
The most general tool is page.evaluate(), which plugs code into the browser. You can put just about any conditions you want here; most other Puppeteer functions are convenience wrappers for common cases you could implement by hand with evaluate.
I can't leave comments, but I made a python version of Anand's answer for anyone who finds it useful (i.e. if they use pyppeteer).
async def waitTillHTMLRendered(page: Page, timeout: int = 30000):
check_duration_m_secs = 1000
max_checks = timeout / check_duration_m_secs
last_HTML_size = 0
check_counts = 1
count_stable_size_iterations = 0
min_stabe_size_iterations = 3
while check_counts <= max_checks:
check_counts += 1
html = await page.content()
currentHTMLSize = len(html);
if(last_HTML_size != 0 and currentHTMLSize == last_HTML_size):
count_stable_size_iterations += 1
else:
count_stable_size_iterations = 0 # reset the counter
if(count_stable_size_iterations >= min_stabe_size_iterations):
break
last_HTML_size = currentHTMLSize
await page.waitFor(check_duration_m_secs)
For me the { waitUntil: 'domcontentloaded' } is always my go to.
I found that networkidle doesnt work well...

How to get puppeteer to simply load a web page?

I can't get puppeteer to do anything. I'm simply trying to get it to show google.com and I can't even get it to do that. Here's my code:
console.log('Loading puppeteer...');
const puppeteer = require('puppeteer');
async function test() {
console.log('Launching browser...');
const browser = await puppeteer.launch({headless: false});
console.log('Creating new page...');
const page = await browser.newPage();
console.log('Requesting url...');
await page.goto('https://www.google.com');
console.log('Closing browser...');
await browser.close();
}
test().catch(e=>{console.log(e)});
Chromium crashes every single time I try do do anything...
Then I get a timeout error:
Loading puppeteer...
Launching browser...
TimeoutError: waiting for target failed: timeout 30000ms exceeded
...
...
I've been searching for a solution for literally weeks. Does this thing just not work anymore?
After looking at this thread, which identifies this as a well-known issue with Puppeteer, here is some more information on Puppeteer timeout problems.
Puppeteer.launch() has two parts that can cause timeout problems. One is goto timing out, and the other is waitfor timing out. Since I don't know what could be causing your specific issue, I'll give you potential solutions for both.
Possible issue #1: Goto is timing out.
I'll directly quote the person who posted this solution, rudiedirkx:
In my case the goto timeout happens because of a forever-loading blocking resource (js or css). That'll never trigger the page's load or domcontentloaded. A bug in Puppeteer IMO, but whatever.
My fix (FINALLY!) is to do what Lighthouse does in its Driver: a Promise.race() for a custom 'timeout'-ish. The shorter version I used:
const LOAD_FAIL = Math.random();
const sleep = options => new Promise(resolve => {
options.timer = setTimeout(resolve, options.ms, options.result === undefined ? true : options.result);
});
const sleepOptions = {ms: TIMEOUT - 1000, result: LOAD_FAIL};
const response = await Promise.race([
sleep(sleepOptions),
page.goto(url, {timeout: TIMEOUT + 1000}),
]);
clearTimeout(sleepOptions.timer);
const success = response !== LOAD_FAIL;
Possible issue #2: Waitfor is timing out.
Alternatively you can try the solution to a waitfor timeout given by dealeros, adding --enable-blink-features=HTMLImports in args:
browser = await puppeteer.launch({
//headless: false,
'args': [
'--enable-blink-features=HTMLImports'
]
});
If neither of those worked
If neither of these solutions work, I recommend browsing that thread to find more solutions people have suggested and see if you can narrow down the problem. Use this code to generate some console logs and see if you can find what's going wrong:
page
.on('console', message =>
console.log(`${message.type().substr(0, 3).toUpperCase()} ${message.text()}`))
.on('pageerror', ({ message }) => console.log(message))
.on('response', response =>
console.log(`${response.status()} ${response.url()}`))
.on('requestfailed', request =>
console.log(`${request.failure().errorText} ${request.url()}`));
These options both resolved the issue for me:
Kill all Chromium processes
pkill -o chromium
Reinstall node packages (if step 1 doesn't help)
rm -rf node_modules
npm install

Close Browser after Navigation Timeout

I have this code below made with nodejs + puppeteer, whose goal is to take a screenshot of the user's site:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://MY_WEBSITE/try/slowURL',{timeout: 30000, waitUntil: 'networkidle0' });//timeout 30 seconds
await page.setViewport({width: 1920, height: 1080});
await page.screenshot({path: pathUpload});
await browser.close();
Its operation is quite simple, but to test the timeout I created a page (http://MY_WEBSITE/try/slowURL) that takes 200 seconds to load.
According to the puppeteer timeout (timeout: 30000), there is a 100% chance of a Navigation Timeout Exceeded: 30000ms exceeded error happening, especially because I'm forcing it.
THE PROBLEM
Through the htop command (used in linux), even after the system crashes and shows "TimeoutError", I can see that the browser has not been closed.
And if the browser is not closed, as scans were done, there is a good chance that the server will run out of memory, and I don't want that.
How can I solve this problem?
You want to wrap your code into a try..catch..finally statement to handle the error and close the browser.
Code Sample
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(/* ... */);
// more code which might throw...
} catch (err) {
console.error('error', err.message);
} finally {
await browser.close();
}
Your main code is executed inside a try block. The catch block shows any kind of error that might happened. The finally part is the part of your script that is always executed, not only when an error is thrown. That way, independent of whether an error happened or not, your script will call the browser.close function.

Is there a way to override "tab closing" in puppeteer cluster?

Puppeteer cluster closing tabs before I can take screenshot.
I am using puppeteer cluster with maxConcurrency 8. I need to take a screenshot after each page loads[Approx. 20000 urls]. Page.screenshot is not useful for me. My screenshot should include URL bar and desktop. Its basically like a full desktop screenshot. So I am using ImageMagick for taking a screenshot, (and xvfb for multiple screen management)
The problem is:
sometimes, screenshot is taken before switching to the right tab.
blank screenshot, coz current tab is closed, and tab which is not yet loaded came to front.
sometimes, error is thrown as screenshot couldnt be taken, because all the tabs were closed.
What I am doing is: when each page loads, I call page.bringToFront and spawn a child_process, which takes screenshot of the desktop using image magic import command.
cluster.queue(postUrl.href); //for adding urls to queue
await page.waitForNavigation(); // Wait for page to load before screenshot
//taking screenshot
const { spawnSync} = require('child_process');
const child = spawnSync('import', [ '-window', 'root', path]);
Dont want to setup waittime after page load, nodejs ImageMagick didnt work, and promise also didnt seem to work.
I do not want the puppeteer to close tab on its own. Instead, can it give callback event once page is loaded, wait for the callback function to be executed and returned and then the tab is closed??
As soon as the Promise of the cluster.task function is resolved, the page will be closed:
await cluster.task(async ({ page, data }) => {
// when this function is done, the page will be closed
});
To keep the page open you can await another Promise at the end before closing:
await cluster.task(async ({ page, data }) => {
// ...
await new Promise(resolve => {
// more code...
// call resolve() when you are done
});
});
Calling the resolve() function at the end will resolve the last Promise and therefore also resolve the whole async function. Therefore, it will close the page. Keep in mind that you want to increase the timeout value to something greater than 30 (default) if necessary when launching the cluster:
const cluster = await Cluster.launch({
// ...
timeout: 120000 // 2 minutes
});

Process blocked at console.log when combining readline and puppeteer

The following code will be blocked forever before console.log("this line.....");.
const puppeteer = require('puppeteer');
const readline = require("readline");
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
async function main() {
browser = await puppeteer.launch();
rl.close();
await browser.close();
console.log("this line will not be executed.");
}
main();
Moving rl.close() below of console.log solves this problem, removing browser = ..... and await browser.close() did the same.
Is this a bug of puppeteer? Or does there are some mechanism I don't understand?
Puppeteer version: 1.11.0
Node.js version: 10.14.2
OS: Windows 10 1803
It seems this is worth to be reported as an issue to the puppeteer GitHub repository. Something really weird happens to stdin and event loop after this combination (Chrome does exits, but the Node.js remains, and after the Ctrl+C abort the prompt appears twice in the Windows shell as if ENTER was buffered till the exit).
FWIW, this issue disappears if terminal option of readline.createInterface() is set to false.
It seems like you do not completely understand how ASYNC/AWAIT works in js.
If you use await inside an async function it will pause the async function and wait for the Promise to resolve prior to moving on.
A code inside your async function will be processed consistently as if it were synchronous, but without blocking the main thread.
async function main() {
browser = await puppeteer.launch(); // will be executed first
rl.close();// will be executed second (wait untill everithing above is finished)
await browser.close(); // will be executed third (wait untill everithing above is finished)
console.log("this line will not be executed."); // will be executed forth (wait untill everithing above is finished)
}

Resources