Is there a way to override "tab closing" in puppeteer cluster? - node.js

Puppeteer cluster closing tabs before I can take screenshot.
I am using puppeteer cluster with maxConcurrency 8. I need to take a screenshot after each page loads[Approx. 20000 urls]. Page.screenshot is not useful for me. My screenshot should include URL bar and desktop. Its basically like a full desktop screenshot. So I am using ImageMagick for taking a screenshot, (and xvfb for multiple screen management)
The problem is:
sometimes, screenshot is taken before switching to the right tab.
blank screenshot, coz current tab is closed, and tab which is not yet loaded came to front.
sometimes, error is thrown as screenshot couldnt be taken, because all the tabs were closed.
What I am doing is: when each page loads, I call page.bringToFront and spawn a child_process, which takes screenshot of the desktop using image magic import command.
cluster.queue(postUrl.href); //for adding urls to queue
await page.waitForNavigation(); // Wait for page to load before screenshot
//taking screenshot
const { spawnSync} = require('child_process');
const child = spawnSync('import', [ '-window', 'root', path]);
Dont want to setup waittime after page load, nodejs ImageMagick didnt work, and promise also didnt seem to work.
I do not want the puppeteer to close tab on its own. Instead, can it give callback event once page is loaded, wait for the callback function to be executed and returned and then the tab is closed??

As soon as the Promise of the cluster.task function is resolved, the page will be closed:
await cluster.task(async ({ page, data }) => {
// when this function is done, the page will be closed
});
To keep the page open you can await another Promise at the end before closing:
await cluster.task(async ({ page, data }) => {
// ...
await new Promise(resolve => {
// more code...
// call resolve() when you are done
});
});
Calling the resolve() function at the end will resolve the last Promise and therefore also resolve the whole async function. Therefore, it will close the page. Keep in mind that you want to increase the timeout value to something greater than 30 (default) if necessary when launching the cluster:
const cluster = await Cluster.launch({
// ...
timeout: 120000 // 2 minutes
});

Related

How to get puppeteer to simply load a web page?

I can't get puppeteer to do anything. I'm simply trying to get it to show google.com and I can't even get it to do that. Here's my code:
console.log('Loading puppeteer...');
const puppeteer = require('puppeteer');
async function test() {
console.log('Launching browser...');
const browser = await puppeteer.launch({headless: false});
console.log('Creating new page...');
const page = await browser.newPage();
console.log('Requesting url...');
await page.goto('https://www.google.com');
console.log('Closing browser...');
await browser.close();
}
test().catch(e=>{console.log(e)});
Chromium crashes every single time I try do do anything...
Then I get a timeout error:
Loading puppeteer...
Launching browser...
TimeoutError: waiting for target failed: timeout 30000ms exceeded
...
...
I've been searching for a solution for literally weeks. Does this thing just not work anymore?
After looking at this thread, which identifies this as a well-known issue with Puppeteer, here is some more information on Puppeteer timeout problems.
Puppeteer.launch() has two parts that can cause timeout problems. One is goto timing out, and the other is waitfor timing out. Since I don't know what could be causing your specific issue, I'll give you potential solutions for both.
Possible issue #1: Goto is timing out.
I'll directly quote the person who posted this solution, rudiedirkx:
In my case the goto timeout happens because of a forever-loading blocking resource (js or css). That'll never trigger the page's load or domcontentloaded. A bug in Puppeteer IMO, but whatever.
My fix (FINALLY!) is to do what Lighthouse does in its Driver: a Promise.race() for a custom 'timeout'-ish. The shorter version I used:
const LOAD_FAIL = Math.random();
const sleep = options => new Promise(resolve => {
options.timer = setTimeout(resolve, options.ms, options.result === undefined ? true : options.result);
});
const sleepOptions = {ms: TIMEOUT - 1000, result: LOAD_FAIL};
const response = await Promise.race([
sleep(sleepOptions),
page.goto(url, {timeout: TIMEOUT + 1000}),
]);
clearTimeout(sleepOptions.timer);
const success = response !== LOAD_FAIL;
Possible issue #2: Waitfor is timing out.
Alternatively you can try the solution to a waitfor timeout given by dealeros, adding --enable-blink-features=HTMLImports in args:
browser = await puppeteer.launch({
//headless: false,
'args': [
'--enable-blink-features=HTMLImports'
]
});
If neither of those worked
If neither of these solutions work, I recommend browsing that thread to find more solutions people have suggested and see if you can narrow down the problem. Use this code to generate some console logs and see if you can find what's going wrong:
page
.on('console', message =>
console.log(`${message.type().substr(0, 3).toUpperCase()} ${message.text()}`))
.on('pageerror', ({ message }) => console.log(message))
.on('response', response =>
console.log(`${response.status()} ${response.url()}`))
.on('requestfailed', request =>
console.log(`${request.failure().errorText} ${request.url()}`));
These options both resolved the issue for me:
Kill all Chromium processes
pkill -o chromium
Reinstall node packages (if step 1 doesn't help)
rm -rf node_modules
npm install

Close Browser after Navigation Timeout

I have this code below made with nodejs + puppeteer, whose goal is to take a screenshot of the user's site:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://MY_WEBSITE/try/slowURL',{timeout: 30000, waitUntil: 'networkidle0' });//timeout 30 seconds
await page.setViewport({width: 1920, height: 1080});
await page.screenshot({path: pathUpload});
await browser.close();
Its operation is quite simple, but to test the timeout I created a page (http://MY_WEBSITE/try/slowURL) that takes 200 seconds to load.
According to the puppeteer timeout (timeout: 30000), there is a 100% chance of a Navigation Timeout Exceeded: 30000ms exceeded error happening, especially because I'm forcing it.
THE PROBLEM
Through the htop command (used in linux), even after the system crashes and shows "TimeoutError", I can see that the browser has not been closed.
And if the browser is not closed, as scans were done, there is a good chance that the server will run out of memory, and I don't want that.
How can I solve this problem?
You want to wrap your code into a try..catch..finally statement to handle the error and close the browser.
Code Sample
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(/* ... */);
// more code which might throw...
} catch (err) {
console.error('error', err.message);
} finally {
await browser.close();
}
Your main code is executed inside a try block. The catch block shows any kind of error that might happened. The finally part is the part of your script that is always executed, not only when an error is thrown. That way, independent of whether an error happened or not, your script will call the browser.close function.

Puppeteers waitFor functions fail BEFORE the page finished rendering

How come waitForFunction, waitForSelector, await page.evaluate etc. all give errors UNLESS I put a 10 seconds delay after reading the page?
I would think these were made to wait for something to happen on the page, but without my 10 seconds delay (just after page.goto) - all of them fail with errors.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://sunnythailand.com')
console.log("Waiting 10 seconds")
await new Promise((resolve)=>setTimeout(()=> resolve() ,10000));
console.log("Here we go....")
console.log("waitForFunction START")
await page.waitForFunction('document.querySelector(".scrapebot_description").textContent.length > 0');
console.log("waitForFunction FOUND scrapebot")
console.log("Waiting for evaluate")
const name = await page.evaluate(() => document.querySelector('.scrapebot_description').textContent)
console.log("Evaluate: " + name)
await browser.close()
})()
My theory is that our sunnythailand.com page sends an "end of page" or something BEFORE it finished rendering, and then all the waitFor functions go crazy and fail with all kinds of strange errors.
So I guess my question is... how do we get waitFor to actually WAIT for the event to happen or class to appear etc...?
Don't use time out cause you don't know how much time for it will take to load full page. It depends person to person on his internet bandwidth.
All you need to rely on promise for your class
await page.waitForSelector('.scrapebot_description');
lets wait for your particular class then it will work fine
Please remove this
//await new Promise((resolve)=>setTimeout(()=> resolve() ,5000));
plese let me know your test result after this. I am sure it will solve.

Generating PDF of a Web Page

I'm trying to generate a pdf file of a web page and want to save to local disk to email later.
I had tried this approach but the problem here is, its not working for pages like this. I'm able to generate the pdf but its not matching with web page content.
Its very clear that pdf is generated before document ready or might be something else. I'm unable to figure out the exact issue. I'm just looking for an approach where I can save web page output as pdf.
I hope generating pdf of a web page is more suitable in node then php? If any solution in php is available then it will be a big help or even node implementation is also fine.
Its very clear that pdf is generated before document ready
Very true, so it is necessary to wait until after scripts are loaded and executed.
You linked to an answer that uses phantom node module.
The module was upgraded since then and now supports async/await functions that make script much much more readable.
If I may suggest a solution that uses the async/await version (version 4.x, requires node 8+).
const phantom = require('phantom');
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms));
(async function() {
const instance = await phantom.create();
const page = await instance.createPage();
await page.property('viewportSize', { width: 1920, height: 1024 });
const status = await page.open('http://www.chartjs.org/samples/latest/charts/pie.html');
// If a page has no set background color, it will have gray bg in PhantomJS
// so we'll set white background ourselves
await page.evaluate(function(){
document.querySelector('body').style.background = '#fff';
});
// Let's benchmark
console.time('wait');
// Wait until the script creates the canvas with the charts
while (0 == await page.evaluate(function(){ return document.querySelectorAll("canvas").length }) ) {
await timeout(250);
}
// Make sure animation of the chart has played
await timeout(500);
console.timeEnd('wait');
await page.render('screen.pdf');
await instance.exit();
})();
On my dev machine it takes 600ms to wait for the chart to be ready. Much better than to await timeout(3000) or any other arbitrary number of seconds.
I did something similiar using html-pdf package.
The code is simple, you can use like this:
pdf.create(html, options).toFile('./YourPDFName.pdf', function(err, res) {
if (err) {
console.log(err);
}
});
See more about it in the package page here.
Hope it help you.

cucumber js execution - ELIFECYLE ERR

I'm new to JS and trying cucumber js for the first time
This is how my step defn looks like:
Pseudocodes
Given("I launch Google.com", async (){
await Launch.launchGoogle():
})
When("I enter search text cucumber js", async (){
await Launch.searchCucumber():
})
This is how my Launch.js looks like:
module.exports launchGoogle= async function() {
await driver.get("www.google.com"):
}
module.exports searchCucumber = async function(){
await driver.findElement(By.name("q")).sendKeys("cucumber");
}
In this case, when I run the feature with 2 steps, I get ELIFECYCLE ERR at the end of first step.
When I remove the await in the step definitions, it runs fine. But, the console displays as 2 steps passed even before the chrome browser is launched. That is, it fires the Given and When steps and shows the result even as the code in Launch.js is still executing.
Pls help how to solve this?
I just figured out that the default step timeout is 5000ms. Since, launching the browser and hitting the URL was taking more than that, it was failing. i just increased the step timeout to 30000ms and it works fine.

Resources