Why does puppeteer page.goto() throw a timeout error? - node.js

The following code throws an error, why?
Navigation Timeout Exceeded: 60000ms exceeded
I'm using puppeteer version 1.19.0
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setCacheEnabled(false);
try {
const response = await page.goto("https://www.gatsbyjs.com", {
waitUntil: "networkidle0",
timeout: 60000
});
console.log("Status code:", response.status());
} catch (error) {
console.log(error.message);
}
await browser.close();
})();
Some other URLs work fine, so I wonder if there is anything special with this particular URL?

If you change the waitUntil : "networkidle2" . There is no time out.
networkidle2 - consider navigation to be finished when there are no
more than 2 network connections for at least 500 ms.
As pointed out in Erez's answer . 'serviceworker' may be holding the connection.You can check it by going to chrome://serviceworker-internals/ . Or Devtools -> Application Tab - Service Wokers
Serive Worker: chrome://serviceworker-internals/
Scope: https://www.gatsbyjs.com/
Registration ID: 295
Navigation preload enabled: false
Navigation preload header length: 4
Active worker:
Installation Status: ACTIVATED
Running Status: RUNNING
Fetch handler existence: EXISTS
Script: https://www.gatsbyjs.com/sw.js
Version ID: 10330
Renderer process ID: 11892
Renderer thread ID: 18124
DevTools agent route ID: 8
From Network : installingWorker ServiceWorker {scriptURL: "https://www.gatsbyjs.com/sw.js", state: "installing", onerror: null, onstatechange: null}
References :
Navigation Timeout Exceeded when using networkidle0 and no insight into what timed out
Support ServiceWorkers #2634

Removing waitUntil: "networkidle0" works so I'm assuming the site is still holding a connection to the server.
I couldn't figure out which connection it is (maybe the service worker?) using the developers tools (accessible in non headless mode by running await puppeteer.launch({ headless: false }))

Related

Puppeteer - Failed to launch the browser process! in Windows Service

I have a small javascript setup to convert html to pdf using the javascript library puppeteer.
Hosting the service by opening the command panel and starting node index.js everything works fine. The express-api hosts the service under the predefined port and requesting the service I get the converted PDF back.
Now, installing the javascript as Windows-Service by using the library node-windows and requesting the service, I get the following error message:
Failed to launch the browser process!
Now I'm not sure where to search for the root cause. Is it possible that this could be a permission issue?
Following my puppeteer javascript code :
const ValidationError = require('./../errors/ValidationError.js')
const puppeteer = require('puppeteer-core');
module.exports = class PdfService{
static async htmlToPdf(html){
if(!html){
throw new ValidationError("no html");
}
const browser = await puppeteer.launch({
headless: true,
executablePath: process.env.EDGE_PATH,
args: ["--no-sandbox"]
});
const page = await browser.newPage();
await page.setContent(html, {
waitUntil: "networkidle2"
});
const pdf = await page.pdf({format: 'A4',printBackground: true});
await browser.close();
return pdf;
}
}

On heroku, puppeteer's Network.webSocketFrameReceived event is never triggered. Why?

I have built a small app that I deployed to heroku. Locally, the whole thing is working as expected. But when deployed, the Network.webSocketFrameReceived event is never triggered. It is a node app that runs on express with a minimal websocket server.
The goal of the app is to open some url using headless chrome (i am using puppeteer here), record the websocket frames and parse them if they contain some specific fields, close connection when successful. Then move to next url.
async function openUrlAndParseFrames(page, url) {
await new Promise(async function (resolve) {
const parseWebsocketFrame = (response) => {
console.log('parsing websocket frame...', response);
let payload;
try {
// some parsing here
} catch (e) {
console.error(`Error while parsing payload ${response.response.payloadData}`)
}
}
console.log('Go to url', url);
await page.goto(url);
const cdp = await page.target().createCDPSession();
await cdp.send('Network.enable');
await cdp.send('Page.enable');
cdp.on('Network.webSocketFrameReceived', parseWebsocketFrame);
});
}
Is it not possible to make this websocket connection on heroku using puppeteer? I never receive the "parsing websocket frame..." logs...
PS:
I am aware of this special args I need to set for puppeteer to run on heroku
puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
Also I added the buildpacks heroku/nodejs and https://github.com/jontewks/puppeteer-heroku-buildpack
I found the answer myself. The real problem was, that the IP range (from Heroku) was blocked and I didn't even access the page I was trying to but was blocked with a 403 from CloudFront.
I figured it out by logging the page content. const websiteContent = await page.content(); Which showed the error page html.
After trying various things I decided to move away from Heroku and now successfully deployed to Google App Engine.

Loading dynamic webpage with Puppeteer works on localhost but not Heroku

Node.js app with Express, deployed on Heroku. It's just dynamic webpages. Loading static webpages works fine.
Loading dynamic webpages works on localhost, but on Heroku it throws me code=H12, desc="Request timeout", service=30000ms, status=503.
In addition, fresh after doing heroku restart or making a deployment, there always seems to be one instance of a status=200 that loads only the static portion of a dynamic webpage.
Screenshot of logs here.
I've tried the following, which have all led to either the same or other unexpected results when deployed on Heroku (such as Error R14 (Memory quota exceeded) and code=H13 desc="Connection closed without response"):
Switching the Puppeteer Heroku buildpack I was using. I've tried the ones mentioned in this troubleshooting guide and this comment.
Adding headless: true in Puppeteer's launch arguments.
Adding the --no-sandbox, --disable-setuid-sandbox, --single-process, and --no-zygote flags in args of Puppeteer's launch arguments. (Reference: this comment & this comment)
Setting the waitUntil argument in Puppeteer's goto function to domcontentloaded, networkidle0 and networkidle2. (Reference: this comment)
Passing a timeout argument in Puppeteer goto function; I've tried 30000 and 60000 specifically, as well as 0 per this comment.
Using the waitForSelector function.
Clearing Heroku's build cache, as per this article.
Printing the url variable (see my code below) in the console. Output is as expected.
I've observed that:
With the code I have right now (see below), the try-catch-finally block never catches any error. It's always one of the following: I get an incomplete result (static portion of requested dynamic webpage), or the app crashes (code=H13 desc="Connection closed without response"). So I haven't been able to get anything out of attempting to print exception in the console from within the catch block.
Any ideas on how I could get this to work?
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
let browser;
...
app.listen(port, async() => {
browser = await puppeteer
.launch({
timeout: 0,
headless: true,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--single-process",
"--no-zygote",
],
});
});
...
app.get("/appropriate-route-name", async (req, res) => {
let url = req.query.url;
let page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle2",
});
res.send({ data: await page.content() });
} catch (exception) {
res.send({ data: null });
} finally {
await browser.close();
}
}
Was able to get it to work by using user-agents. Dynamic pages now load just fine on Heroku; requests don't time out every single time anymore.
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
var userAgent = require("user-agents");
...
app.get("/route-name", async (req, res) => {
let url = req.query.url;
let browser = await puppeteer.launch({
args: ["--no-sandbox"],
});
let page = await browser.newPage();
try {
await page.setUserAgent(userAgent.toString()); // added this
await page.goto(url, {
timeout: 30000,
waitUntil: "newtorkidle2", // or "networkidle0", depending on what you need
});
res.send({ data: await page.content() });
} catch (e) {
res.send({ data: null });
} finally {
await browser.close();
}
});

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

Puppeteer - Protocol error (Page.navigate): Target closed

As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL:
const cluster = require('cluster');
const express = require('express');
const bodyParser = require('body-parser');
const puppeteer = require('puppeteer');
async function getScreenshot(domain) {
let screenshot;
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'] });
const page = await browser.newPage();
try {
await page.goto('http://' + domain + '/', { timeout: 60000, waitUntil: 'networkidle2' });
} catch (error) {
try {
await page.goto('http://' + domain + '/', { timeout: 120000, waitUntil: 'networkidle2' });
screenshot = await page.screenshot({ type: 'png', encoding: 'base64' });
} catch (error) {
console.error('Connecting to: ' + domain + ' failed due to: ' + error);
}
await page.close();
await browser.close();
return screenshot;
}
if (cluster.isMaster) {
const numOfWorkers = require('os').cpus().length;
for (let worker = 0; worker < numOfWorkers; worker++) {
cluster.fork();
}
cluster.on('exit', function (worker, code, signal) {
console.debug('Worker ' + worker.process.pid + ' died with code: ' + code + ', and signal: ' + signal);
Cluster.fork();
});
cluster.on('message', function (handler, msg) {
console.debug('Worker: ' + handler.process.pid + ' has finished working on ' + msg.domain + '. Exiting...');
if (Cluster.workers[handler.id]) {
Cluster.workers[handler.id].kill('SIGTERM');
}
});
} else {
const app = express();
app.use(bodyParser.json());
app.listen(80, function() {
console.debug('Worker ' + process.pid + ' is listening to incoming messages');
});
app.post('/screenshot', (req, res) => {
const domain = req.body.domain;
getScreenshot(domain)
.then((screenshot) =>
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(200).json({ screenshot: screenshot });
})
.catch((error) => {
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(500).json({ error: error });
});
});
}
Some explanation:
Each time a request arrives a worker will process it and kill itself at the end
Each worker creates a new browser instance with a single page, and if a page took more than 60sec to load, it will retry reloading it (in the same page because maybe some resources has already been loaded) with timeout of 120sec
Once finished both the page and the browser will be closed
My problem is that some legitimate domains get errors that I can't explain:
Error: Protocol error (Page.navigate): Target closed.
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I read at some git issue (that I can't find now) that it can happen when the page redirects and adds 'www' at the start, but I'm hoping it's false...
Is there something I'm missing?
What "Target closed" means
When you launch a browser via puppeteer.launch it will start a browser and connect to it. From there on any function you execute on your opened browser (like page.goto) will be send via the Chrome DevTools Protocol to the browser. A target means a tab in this context.
The Target closed exception is thrown when you are trying to run a function, but the target (tab) was already closed.
Similar error messages
The error message was recently changed to give more meaningful information. It now gives the following message:
Error: Protocol error (Target.activateTarget): Session closed. Most likely the page has been closed.
Why does it happen
There are multiple reasons why this could happen.
You used a resource that was already closed
Most likely, you are seeing this message because you closed the tab/browser and are still trying to use the resource. To give an simple example:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await browser.close();
await page.goto('http://www.google.com');
In this case the browser was closed and after that, a page.goto was called resulting in the error message. Most of the time, it will not be that obvious. Maybe an error handler already closed the page during a cleanup task, while your script is still crawling.
The browser crashed or was unable to initialize
I also experience this every few hundred requests. There is an issue about this on the puppeteer repository as well. It seems to be the case, when you are using a lot of memory or CPU power. Maybe you are spawning a lot of browser? In these cases the browser might crash or disconnect.
I found no "silver bullet" solution to this problem. But you might want to check out the library puppeteer-cluster (disclaimer: I'm the author) which handles these kind of error cases and let's you retry the URL when the error happens. It can also manage a pool of browser instances and would also simplify your code.
For me removing '--single-process' from args fixed the issue.
puppeteerOptions: {
headless: true,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--deterministic-fetch',
'--disable-features=IsolateOrigins',
'--disable-site-isolation-trials',
// '--single-process',
],
}
I was just experiencing the same issue every time I tried running my puppeteer script*. The above did not resolve this issue for me.
I got it to work by removing and reinstalling the puppeteer package:
npm remove puppeteer
npm i puppeteer
*I only experienced this issue when setting the headless option to 'false`
I've wound up at this thread a few times, and the typical culprit is that I forgot to await a Puppeteer page call that returned a promise, causing a race condition.
Here's a minimal example of what this can look like:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
page.goto("https://www.stackoverflow.com"); // whoops, forgot await!
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output is:
C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217
this._callbacks.set(id, { resolve, reject, error: new Error(), method });
^
Error: Protocol error (Page.navigate): Target closed.
at C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217:63
In this case, it seems like an unmissable error, but in a larger chunk of code and the promise is nested or in a condition, it's easy to overlook.
You'll get a similar error for forgetting to await a page.click() or other promise call, for example, Error: Protocol error (Runtime.callFunctionOn): Target closed., which can be seen in the question UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Target closed. (Puppeteer)
This is a contribution to the thread as a canonical resource for the error and may not be the solution to OP's problem, although the fundamental race condition seems to be a likely cause.
In 2021 I'm receiving the very similar following error Error: Error pdf creationError: Protocol error (Target.setDiscoverTargets): Target closed., I solved it by playing with different args, so if your production server has a pipe:true flag in puppeteer.launch obj it will produce errors.
Also --disable-dev-shm-usage flag do the trick
The solution below works for me:
const browser = await puppeteer.launch({
headless: true,
// pipe: true, <-- delete this property
args: [
'--no-sandbox',
'--disable-dev-shm-usage', // <-- add this one
],
});
Check your jest-puppeteer.config.js file.
I made the below mistake
module.exports = {
launch: {
headless: false,
browserContext: "default",
},
};
and after correcting it as below
module.exports = {
launch: {
headless: false
},
browserContext: "default",
};
everything worked just fine!!!
After hours of frustrations I realized that this happens when it goes to a new page and I need to be using await page.waitForNavigation() before I do anything and after I press a button or do any action that will cause it to redirect.

Resources