How to get puppeteer to simply load a web page? - node.js

I can't get puppeteer to do anything. I'm simply trying to get it to show google.com and I can't even get it to do that. Here's my code:
console.log('Loading puppeteer...');
const puppeteer = require('puppeteer');
async function test() {
console.log('Launching browser...');
const browser = await puppeteer.launch({headless: false});
console.log('Creating new page...');
const page = await browser.newPage();
console.log('Requesting url...');
await page.goto('https://www.google.com');
console.log('Closing browser...');
await browser.close();
}
test().catch(e=>{console.log(e)});
Chromium crashes every single time I try do do anything...
Then I get a timeout error:
Loading puppeteer...
Launching browser...
TimeoutError: waiting for target failed: timeout 30000ms exceeded
...
...
I've been searching for a solution for literally weeks. Does this thing just not work anymore?

After looking at this thread, which identifies this as a well-known issue with Puppeteer, here is some more information on Puppeteer timeout problems.
Puppeteer.launch() has two parts that can cause timeout problems. One is goto timing out, and the other is waitfor timing out. Since I don't know what could be causing your specific issue, I'll give you potential solutions for both.
Possible issue #1: Goto is timing out.
I'll directly quote the person who posted this solution, rudiedirkx:
In my case the goto timeout happens because of a forever-loading blocking resource (js or css). That'll never trigger the page's load or domcontentloaded. A bug in Puppeteer IMO, but whatever.
My fix (FINALLY!) is to do what Lighthouse does in its Driver: a Promise.race() for a custom 'timeout'-ish. The shorter version I used:
const LOAD_FAIL = Math.random();
const sleep = options => new Promise(resolve => {
options.timer = setTimeout(resolve, options.ms, options.result === undefined ? true : options.result);
});
const sleepOptions = {ms: TIMEOUT - 1000, result: LOAD_FAIL};
const response = await Promise.race([
sleep(sleepOptions),
page.goto(url, {timeout: TIMEOUT + 1000}),
]);
clearTimeout(sleepOptions.timer);
const success = response !== LOAD_FAIL;
Possible issue #2: Waitfor is timing out.
Alternatively you can try the solution to a waitfor timeout given by dealeros, adding --enable-blink-features=HTMLImports in args:
browser = await puppeteer.launch({
//headless: false,
'args': [
'--enable-blink-features=HTMLImports'
]
});
If neither of those worked
If neither of these solutions work, I recommend browsing that thread to find more solutions people have suggested and see if you can narrow down the problem. Use this code to generate some console logs and see if you can find what's going wrong:
page
.on('console', message =>
console.log(`${message.type().substr(0, 3).toUpperCase()} ${message.text()}`))
.on('pageerror', ({ message }) => console.log(message))
.on('response', response =>
console.log(`${response.status()} ${response.url()}`))
.on('requestfailed', request =>
console.log(`${request.failure().errorText} ${request.url()}`));

These options both resolved the issue for me:
Kill all Chromium processes
pkill -o chromium
Reinstall node packages (if step 1 doesn't help)
rm -rf node_modules
npm install

Related

Node.JS + Puppeteer: browser.close freezes process

I have a really simple function I'm running.
module.exports.test = async () => {
const browser = await puppeteer.launch(puppetOptions);
try {
const page = await browser.newPage();
await page.goto('https://google.com');
// helper function that pauses for five seconds before moving on
await pause(5000);
await browser.close();
console.log('browser closed');
} catch (err) {
console.log(err);
await browser.close();
}
}
I run it from my index.js file:
const server = app.listen(process.env.PORT || 5000, () => {
test();
});
Now I run it in my terminal with node index.js. It opens a browser. It opens a new page and navigates to Google. It waits five seconds. It closes the browser. Everything appears great. I hit ctrl + c in my terminal to stop the process, but nothing happens. Typically this works. If I remove the browser.close function, ctrl + c goes back to working as expected, and ends the process. This function I'm running is the result of me breaking down a more complex function that appears to have a memory leak, so it really seems that browser.close is the culprit. For the life of me though, I can't figure out why it would be causing an issue when simplified this much. This is happening in both headless, and headfull modes. Here are the puppeteer launch options:
puppetOptions = {
defaultViewport: null,
args: [
"--incognito",
"--no-sandbox",
"--single-process",
"--no-zygote"
],
}
puppetOptionsHeadfull = {
headless: false,
executablePath: 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe',
}
EDIT: I tried this in my bash terminal as well and have the same issue. When I try to manually close the terminal to abort it, an error pops up.
Processes are running in session:
| WPID PID COMMAND
| 10900 1122 winpty node.exe index.js
Close anyway?
EDIT 2: Narrowed this down to it being an issue with puppeteer-extra most likely, however, it seems like a bug in the core package. Fairly recent open issue on their repo that reflects this bug found here: https://github.com/berstend/puppeteer-extra/issues/421
I'll leave this question open just in case anyone else stumbles on it with the same issue, they don't pull their hair out debugging it.
I have been having this issue as well. It seems for me that the two arguments '--no-sandbox', and '--disable-setuid-sandbox' are culprits, although it's not intuitive why that would be the case. Try removing the no sandbox argument

Close Browser after Navigation Timeout

I have this code below made with nodejs + puppeteer, whose goal is to take a screenshot of the user's site:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://MY_WEBSITE/try/slowURL',{timeout: 30000, waitUntil: 'networkidle0' });//timeout 30 seconds
await page.setViewport({width: 1920, height: 1080});
await page.screenshot({path: pathUpload});
await browser.close();
Its operation is quite simple, but to test the timeout I created a page (http://MY_WEBSITE/try/slowURL) that takes 200 seconds to load.
According to the puppeteer timeout (timeout: 30000), there is a 100% chance of a Navigation Timeout Exceeded: 30000ms exceeded error happening, especially because I'm forcing it.
THE PROBLEM
Through the htop command (used in linux), even after the system crashes and shows "TimeoutError", I can see that the browser has not been closed.
And if the browser is not closed, as scans were done, there is a good chance that the server will run out of memory, and I don't want that.
How can I solve this problem?
You want to wrap your code into a try..catch..finally statement to handle the error and close the browser.
Code Sample
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(/* ... */);
// more code which might throw...
} catch (err) {
console.error('error', err.message);
} finally {
await browser.close();
}
Your main code is executed inside a try block. The catch block shows any kind of error that might happened. The finally part is the part of your script that is always executed, not only when an error is thrown. That way, independent of whether an error happened or not, your script will call the browser.close function.

Is there a way to override "tab closing" in puppeteer cluster?

Puppeteer cluster closing tabs before I can take screenshot.
I am using puppeteer cluster with maxConcurrency 8. I need to take a screenshot after each page loads[Approx. 20000 urls]. Page.screenshot is not useful for me. My screenshot should include URL bar and desktop. Its basically like a full desktop screenshot. So I am using ImageMagick for taking a screenshot, (and xvfb for multiple screen management)
The problem is:
sometimes, screenshot is taken before switching to the right tab.
blank screenshot, coz current tab is closed, and tab which is not yet loaded came to front.
sometimes, error is thrown as screenshot couldnt be taken, because all the tabs were closed.
What I am doing is: when each page loads, I call page.bringToFront and spawn a child_process, which takes screenshot of the desktop using image magic import command.
cluster.queue(postUrl.href); //for adding urls to queue
await page.waitForNavigation(); // Wait for page to load before screenshot
//taking screenshot
const { spawnSync} = require('child_process');
const child = spawnSync('import', [ '-window', 'root', path]);
Dont want to setup waittime after page load, nodejs ImageMagick didnt work, and promise also didnt seem to work.
I do not want the puppeteer to close tab on its own. Instead, can it give callback event once page is loaded, wait for the callback function to be executed and returned and then the tab is closed??
As soon as the Promise of the cluster.task function is resolved, the page will be closed:
await cluster.task(async ({ page, data }) => {
// when this function is done, the page will be closed
});
To keep the page open you can await another Promise at the end before closing:
await cluster.task(async ({ page, data }) => {
// ...
await new Promise(resolve => {
// more code...
// call resolve() when you are done
});
});
Calling the resolve() function at the end will resolve the last Promise and therefore also resolve the whole async function. Therefore, it will close the page. Keep in mind that you want to increase the timeout value to something greater than 30 (default) if necessary when launching the cluster:
const cluster = await Cluster.launch({
// ...
timeout: 120000 // 2 minutes
});

Nodejs/Puppeteer - Navigation timeout

I need help to undestand how timeout works, especially with node/puppeteer
I read all stack questions and github issues about this, but i can figure it out what is wrong
Probably my code...
When i run this file, i receive the error from image. You can see the ways i tryied to fix it, nothing works
Can someone explain why this happens and the best approach to avoid this? Is there a better way to get these Projects?
//vou até os seeds em x tempo
var https = require('https');
var Q = require('q');
var fs = require('fs');
var puppeteer = require('puppeteer');
var Projeto = require('./Projeto.js');
const url = 'https://www.99freelas.com.br/projects?categoria=web-e-desenvolvimento'
/*const idToScrape;
deverá receber qual a url e os parametros específicos de cada seed */
async function genScraper() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//page.setDefaultNavigationTimeout(60000);
page.waitForNavigation( { timeout: 60000, waitUntil: 'domcontentloaded' });
await page.goto(url);
var projetos = await page.evaluate(() => {
let qtProjs = document.querySelectorAll('.result-list li').length;
let listaDeProjs = Array.from(document.querySelectorAll('.result-list li'));
let tempProjetos = [];
for( var i=0; i<=listaDeProjs.length; i++ ) {
let titulo = listaDeProjs[i].children[1].children[0].textContent;
let descricao = listaDeProjs[i].children[2].textContent;
let habilidades = listaDeProjs[i].children[3].textContent;
let publicado = listaDeProjs[i].children[1].children[1].children[0].textContent;
let tempoRestante = listaDeProjs[i].children[1].children[1].children[1].textContent;
//let infoCliente;
proj = new Projeto(titulo, descricao, habilidades, publicado, tempoRestante);
tempProjetos.push(proj);
}
return tempProjetos;
});
console.log(projetos);
browser.close();
}
genScraper();
I recommend you to avoid using the method waitForNavigation before the goTo call.
Basically, It would be better to use the method gotTo with the default value, that is 30000. In my opinion, if the website takes more than 30 seconds to work or respond, there should be something wrong.
Instead, I would do something like this:
await page.goto(url, {
waitUntil: 'networkidle0'
});
Depending on the version of puppeteer that you're using, you will have different behaviours. I am using version 1.4.0 and it is working good so far.
Inside the documentation states the following:
The page.goto will throw an error if:
there's an SSL error (e.g. in case of self-signed certificates).
target URL is invalid.
the timeout is exceeded during navigation.
the main resource failed to load.
So, check that none of the previous scenarios is happening.
Also, you can curl the URL from your terminal to see if the URL respond to outside calls, cross origin problems are common too.
Sincerely, there is no way to say what can be triggering your timeout, but that checklist should help. I had a problem with timeout recently and the problem was my server configuration, so I suggest you to see also if the machine in which you are running this code, has the necessary memory to execute.
In your for loop,
for( var i=0; i<=listaDeProjs; i++ ) {
...
}
listaDeProjs should be listaDeProjs.length
Your evaluation script will fail in several places, if anywhere along this path is undefined: (E.g., if children[1] is undefined or children[0] is undefined.)
listaDeProjs[i].children[1].children[0].textContent;
You can do the following with lodash:
_.get(listaDeProjs[i],"children[1].children[0].textContent","")
That will default to "" if there is no such value.
Additionally, the following works perfectly fine with your code in 1.7 via https://try-puppeteer.appspot.com/
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: '5000'
});

Puppeteers waitFor functions fail BEFORE the page finished rendering

How come waitForFunction, waitForSelector, await page.evaluate etc. all give errors UNLESS I put a 10 seconds delay after reading the page?
I would think these were made to wait for something to happen on the page, but without my 10 seconds delay (just after page.goto) - all of them fail with errors.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://sunnythailand.com')
console.log("Waiting 10 seconds")
await new Promise((resolve)=>setTimeout(()=> resolve() ,10000));
console.log("Here we go....")
console.log("waitForFunction START")
await page.waitForFunction('document.querySelector(".scrapebot_description").textContent.length > 0');
console.log("waitForFunction FOUND scrapebot")
console.log("Waiting for evaluate")
const name = await page.evaluate(() => document.querySelector('.scrapebot_description').textContent)
console.log("Evaluate: " + name)
await browser.close()
})()
My theory is that our sunnythailand.com page sends an "end of page" or something BEFORE it finished rendering, and then all the waitFor functions go crazy and fail with all kinds of strange errors.
So I guess my question is... how do we get waitFor to actually WAIT for the event to happen or class to appear etc...?
Don't use time out cause you don't know how much time for it will take to load full page. It depends person to person on his internet bandwidth.
All you need to rely on promise for your class
await page.waitForSelector('.scrapebot_description');
lets wait for your particular class then it will work fine
Please remove this
//await new Promise((resolve)=>setTimeout(()=> resolve() ,5000));
plese let me know your test result after this. I am sure it will solve.

Resources