Puppeteer - Protocol error (Page.navigate): Target closed - node.js

As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL:
const cluster = require('cluster');
const express = require('express');
const bodyParser = require('body-parser');
const puppeteer = require('puppeteer');
async function getScreenshot(domain) {
let screenshot;
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'] });
const page = await browser.newPage();
try {
await page.goto('http://' + domain + '/', { timeout: 60000, waitUntil: 'networkidle2' });
} catch (error) {
try {
await page.goto('http://' + domain + '/', { timeout: 120000, waitUntil: 'networkidle2' });
screenshot = await page.screenshot({ type: 'png', encoding: 'base64' });
} catch (error) {
console.error('Connecting to: ' + domain + ' failed due to: ' + error);
}
await page.close();
await browser.close();
return screenshot;
}
if (cluster.isMaster) {
const numOfWorkers = require('os').cpus().length;
for (let worker = 0; worker < numOfWorkers; worker++) {
cluster.fork();
}
cluster.on('exit', function (worker, code, signal) {
console.debug('Worker ' + worker.process.pid + ' died with code: ' + code + ', and signal: ' + signal);
Cluster.fork();
});
cluster.on('message', function (handler, msg) {
console.debug('Worker: ' + handler.process.pid + ' has finished working on ' + msg.domain + '. Exiting...');
if (Cluster.workers[handler.id]) {
Cluster.workers[handler.id].kill('SIGTERM');
}
});
} else {
const app = express();
app.use(bodyParser.json());
app.listen(80, function() {
console.debug('Worker ' + process.pid + ' is listening to incoming messages');
});
app.post('/screenshot', (req, res) => {
const domain = req.body.domain;
getScreenshot(domain)
.then((screenshot) =>
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(200).json({ screenshot: screenshot });
})
.catch((error) => {
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(500).json({ error: error });
});
});
}
Some explanation:
Each time a request arrives a worker will process it and kill itself at the end
Each worker creates a new browser instance with a single page, and if a page took more than 60sec to load, it will retry reloading it (in the same page because maybe some resources has already been loaded) with timeout of 120sec
Once finished both the page and the browser will be closed
My problem is that some legitimate domains get errors that I can't explain:
Error: Protocol error (Page.navigate): Target closed.
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I read at some git issue (that I can't find now) that it can happen when the page redirects and adds 'www' at the start, but I'm hoping it's false...
Is there something I'm missing?

What "Target closed" means
When you launch a browser via puppeteer.launch it will start a browser and connect to it. From there on any function you execute on your opened browser (like page.goto) will be send via the Chrome DevTools Protocol to the browser. A target means a tab in this context.
The Target closed exception is thrown when you are trying to run a function, but the target (tab) was already closed.
Similar error messages
The error message was recently changed to give more meaningful information. It now gives the following message:
Error: Protocol error (Target.activateTarget): Session closed. Most likely the page has been closed.
Why does it happen
There are multiple reasons why this could happen.
You used a resource that was already closed
Most likely, you are seeing this message because you closed the tab/browser and are still trying to use the resource. To give an simple example:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await browser.close();
await page.goto('http://www.google.com');
In this case the browser was closed and after that, a page.goto was called resulting in the error message. Most of the time, it will not be that obvious. Maybe an error handler already closed the page during a cleanup task, while your script is still crawling.
The browser crashed or was unable to initialize
I also experience this every few hundred requests. There is an issue about this on the puppeteer repository as well. It seems to be the case, when you are using a lot of memory or CPU power. Maybe you are spawning a lot of browser? In these cases the browser might crash or disconnect.
I found no "silver bullet" solution to this problem. But you might want to check out the library puppeteer-cluster (disclaimer: I'm the author) which handles these kind of error cases and let's you retry the URL when the error happens. It can also manage a pool of browser instances and would also simplify your code.

For me removing '--single-process' from args fixed the issue.
puppeteerOptions: {
headless: true,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--deterministic-fetch',
'--disable-features=IsolateOrigins',
'--disable-site-isolation-trials',
// '--single-process',
],
}

I was just experiencing the same issue every time I tried running my puppeteer script*. The above did not resolve this issue for me.
I got it to work by removing and reinstalling the puppeteer package:
npm remove puppeteer
npm i puppeteer
*I only experienced this issue when setting the headless option to 'false`

I've wound up at this thread a few times, and the typical culprit is that I forgot to await a Puppeteer page call that returned a promise, causing a race condition.
Here's a minimal example of what this can look like:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
page.goto("https://www.stackoverflow.com"); // whoops, forgot await!
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output is:
C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217
this._callbacks.set(id, { resolve, reject, error: new Error(), method });
^
Error: Protocol error (Page.navigate): Target closed.
at C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217:63
In this case, it seems like an unmissable error, but in a larger chunk of code and the promise is nested or in a condition, it's easy to overlook.
You'll get a similar error for forgetting to await a page.click() or other promise call, for example, Error: Protocol error (Runtime.callFunctionOn): Target closed., which can be seen in the question UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Target closed. (Puppeteer)
This is a contribution to the thread as a canonical resource for the error and may not be the solution to OP's problem, although the fundamental race condition seems to be a likely cause.

In 2021 I'm receiving the very similar following error Error: Error pdf creationError: Protocol error (Target.setDiscoverTargets): Target closed., I solved it by playing with different args, so if your production server has a pipe:true flag in puppeteer.launch obj it will produce errors.
Also --disable-dev-shm-usage flag do the trick
The solution below works for me:
const browser = await puppeteer.launch({
headless: true,
// pipe: true, <-- delete this property
args: [
'--no-sandbox',
'--disable-dev-shm-usage', // <-- add this one
],
});

Check your jest-puppeteer.config.js file.
I made the below mistake
module.exports = {
launch: {
headless: false,
browserContext: "default",
},
};
and after correcting it as below
module.exports = {
launch: {
headless: false
},
browserContext: "default",
};
everything worked just fine!!!

After hours of frustrations I realized that this happens when it goes to a new page and I need to be using await page.waitForNavigation() before I do anything and after I press a button or do any action that will cause it to redirect.

Related

How to re run puppeteer bot when encounter error

I've used puppeteer cluster and I want to create something that acts similar to how cluster automatically restarts the puppeteer bot when encountering an error. I want to re-run my bot when I hit an unknown error. Sometimes my bot is moving a little slow due to the network and fails out, or it can't find a specific button and fails. But would work on the next run, so I want to remedy this by restarting the bot automatically.
I tried to do this with cluster, but it seemed like overkill, is there a better way I can accomplish this??
const activateCluster = async (posts) =>{
const cluster = await Cluster.launch( {puppeteerOptions: {
headless: false,
defaultViewport: null,
},
puppeteer,
// monitor:true,
retryLimit:5,
timeout:180000,
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 1,
});
cluster.on('taskerror', (err, data, willRetry) => {
if (willRetry) {
console.warn(`Encountered an error while crawling ${data}. ${err.message}\nThis job will be retried`);
} else {
console.error(`Failed to crawl ${data}: ${err.message}`);
}
});
await cluster.task(startBot(page))
cluster.queue()
await cluster.idle();
await cluster.close();
}

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

Loading dynamic webpage with Puppeteer works on localhost but not Heroku

Node.js app with Express, deployed on Heroku. It's just dynamic webpages. Loading static webpages works fine.
Loading dynamic webpages works on localhost, but on Heroku it throws me code=H12, desc="Request timeout", service=30000ms, status=503.
In addition, fresh after doing heroku restart or making a deployment, there always seems to be one instance of a status=200 that loads only the static portion of a dynamic webpage.
Screenshot of logs here.
I've tried the following, which have all led to either the same or other unexpected results when deployed on Heroku (such as Error R14 (Memory quota exceeded) and code=H13 desc="Connection closed without response"):
Switching the Puppeteer Heroku buildpack I was using. I've tried the ones mentioned in this troubleshooting guide and this comment.
Adding headless: true in Puppeteer's launch arguments.
Adding the --no-sandbox, --disable-setuid-sandbox, --single-process, and --no-zygote flags in args of Puppeteer's launch arguments. (Reference: this comment & this comment)
Setting the waitUntil argument in Puppeteer's goto function to domcontentloaded, networkidle0 and networkidle2. (Reference: this comment)
Passing a timeout argument in Puppeteer goto function; I've tried 30000 and 60000 specifically, as well as 0 per this comment.
Using the waitForSelector function.
Clearing Heroku's build cache, as per this article.
Printing the url variable (see my code below) in the console. Output is as expected.
I've observed that:
With the code I have right now (see below), the try-catch-finally block never catches any error. It's always one of the following: I get an incomplete result (static portion of requested dynamic webpage), or the app crashes (code=H13 desc="Connection closed without response"). So I haven't been able to get anything out of attempting to print exception in the console from within the catch block.
Any ideas on how I could get this to work?
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
let browser;
...
app.listen(port, async() => {
browser = await puppeteer
.launch({
timeout: 0,
headless: true,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--single-process",
"--no-zygote",
],
});
});
...
app.get("/appropriate-route-name", async (req, res) => {
let url = req.query.url;
let page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle2",
});
res.send({ data: await page.content() });
} catch (exception) {
res.send({ data: null });
} finally {
await browser.close();
}
}
Was able to get it to work by using user-agents. Dynamic pages now load just fine on Heroku; requests don't time out every single time anymore.
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
var userAgent = require("user-agents");
...
app.get("/route-name", async (req, res) => {
let url = req.query.url;
let browser = await puppeteer.launch({
args: ["--no-sandbox"],
});
let page = await browser.newPage();
try {
await page.setUserAgent(userAgent.toString()); // added this
await page.goto(url, {
timeout: 30000,
waitUntil: "newtorkidle2", // or "networkidle0", depending on what you need
});
res.send({ data: await page.content() });
} catch (e) {
res.send({ data: null });
} finally {
await browser.close();
}
});

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

Why does puppeteer page.goto() throw a timeout error?

The following code throws an error, why?
Navigation Timeout Exceeded: 60000ms exceeded
I'm using puppeteer version 1.19.0
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setCacheEnabled(false);
try {
const response = await page.goto("https://www.gatsbyjs.com", {
waitUntil: "networkidle0",
timeout: 60000
});
console.log("Status code:", response.status());
} catch (error) {
console.log(error.message);
}
await browser.close();
})();
Some other URLs work fine, so I wonder if there is anything special with this particular URL?
If you change the waitUntil : "networkidle2" . There is no time out.
networkidle2 - consider navigation to be finished when there are no
more than 2 network connections for at least 500 ms.
As pointed out in Erez's answer . 'serviceworker' may be holding the connection.You can check it by going to chrome://serviceworker-internals/ . Or Devtools -> Application Tab - Service Wokers
Serive Worker: chrome://serviceworker-internals/
Scope: https://www.gatsbyjs.com/
Registration ID: 295
Navigation preload enabled: false
Navigation preload header length: 4
Active worker:
Installation Status: ACTIVATED
Running Status: RUNNING
Fetch handler existence: EXISTS
Script: https://www.gatsbyjs.com/sw.js
Version ID: 10330
Renderer process ID: 11892
Renderer thread ID: 18124
DevTools agent route ID: 8
From Network : installingWorker ServiceWorker {scriptURL: "https://www.gatsbyjs.com/sw.js", state: "installing", onerror: null, onstatechange: null}
References :
Navigation Timeout Exceeded when using networkidle0 and no insight into what timed out
Support ServiceWorkers #2634
Removing waitUntil: "networkidle0" works so I'm assuming the site is still holding a connection to the server.
I couldn't figure out which connection it is (maybe the service worker?) using the developers tools (accessible in non headless mode by running await puppeteer.launch({ headless: false }))

Resources