I've used puppeteer cluster and I want to create something that acts similar to how cluster automatically restarts the puppeteer bot when encountering an error. I want to re-run my bot when I hit an unknown error. Sometimes my bot is moving a little slow due to the network and fails out, or it can't find a specific button and fails. But would work on the next run, so I want to remedy this by restarting the bot automatically.
I tried to do this with cluster, but it seemed like overkill, is there a better way I can accomplish this??
const activateCluster = async (posts) =>{
const cluster = await Cluster.launch( {puppeteerOptions: {
headless: false,
defaultViewport: null,
},
puppeteer,
// monitor:true,
retryLimit:5,
timeout:180000,
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 1,
});
cluster.on('taskerror', (err, data, willRetry) => {
if (willRetry) {
console.warn(`Encountered an error while crawling ${data}. ${err.message}\nThis job will be retried`);
} else {
console.error(`Failed to crawl ${data}: ${err.message}`);
}
});
await cluster.task(startBot(page))
cluster.queue()
await cluster.idle();
await cluster.close();
}
Related
I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
I have built a small app that I deployed to heroku. Locally, the whole thing is working as expected. But when deployed, the Network.webSocketFrameReceived event is never triggered. It is a node app that runs on express with a minimal websocket server.
The goal of the app is to open some url using headless chrome (i am using puppeteer here), record the websocket frames and parse them if they contain some specific fields, close connection when successful. Then move to next url.
async function openUrlAndParseFrames(page, url) {
await new Promise(async function (resolve) {
const parseWebsocketFrame = (response) => {
console.log('parsing websocket frame...', response);
let payload;
try {
// some parsing here
} catch (e) {
console.error(`Error while parsing payload ${response.response.payloadData}`)
}
}
console.log('Go to url', url);
await page.goto(url);
const cdp = await page.target().createCDPSession();
await cdp.send('Network.enable');
await cdp.send('Page.enable');
cdp.on('Network.webSocketFrameReceived', parseWebsocketFrame);
});
}
Is it not possible to make this websocket connection on heroku using puppeteer? I never receive the "parsing websocket frame..." logs...
PS:
I am aware of this special args I need to set for puppeteer to run on heroku
puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
Also I added the buildpacks heroku/nodejs and https://github.com/jontewks/puppeteer-heroku-buildpack
I found the answer myself. The real problem was, that the IP range (from Heroku) was blocked and I didn't even access the page I was trying to but was blocked with a 403 from CloudFront.
I figured it out by logging the page content. const websiteContent = await page.content(); Which showed the error page html.
After trying various things I decided to move away from Heroku and now successfully deployed to Google App Engine.
Node.js app with Express, deployed on Heroku. It's just dynamic webpages. Loading static webpages works fine.
Loading dynamic webpages works on localhost, but on Heroku it throws me code=H12, desc="Request timeout", service=30000ms, status=503.
In addition, fresh after doing heroku restart or making a deployment, there always seems to be one instance of a status=200 that loads only the static portion of a dynamic webpage.
Screenshot of logs here.
I've tried the following, which have all led to either the same or other unexpected results when deployed on Heroku (such as Error R14 (Memory quota exceeded) and code=H13 desc="Connection closed without response"):
Switching the Puppeteer Heroku buildpack I was using. I've tried the ones mentioned in this troubleshooting guide and this comment.
Adding headless: true in Puppeteer's launch arguments.
Adding the --no-sandbox, --disable-setuid-sandbox, --single-process, and --no-zygote flags in args of Puppeteer's launch arguments. (Reference: this comment & this comment)
Setting the waitUntil argument in Puppeteer's goto function to domcontentloaded, networkidle0 and networkidle2. (Reference: this comment)
Passing a timeout argument in Puppeteer goto function; I've tried 30000 and 60000 specifically, as well as 0 per this comment.
Using the waitForSelector function.
Clearing Heroku's build cache, as per this article.
Printing the url variable (see my code below) in the console. Output is as expected.
I've observed that:
With the code I have right now (see below), the try-catch-finally block never catches any error. It's always one of the following: I get an incomplete result (static portion of requested dynamic webpage), or the app crashes (code=H13 desc="Connection closed without response"). So I haven't been able to get anything out of attempting to print exception in the console from within the catch block.
Any ideas on how I could get this to work?
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
let browser;
...
app.listen(port, async() => {
browser = await puppeteer
.launch({
timeout: 0,
headless: true,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--single-process",
"--no-zygote",
],
});
});
...
app.get("/appropriate-route-name", async (req, res) => {
let url = req.query.url;
let page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle2",
});
res.send({ data: await page.content() });
} catch (exception) {
res.send({ data: null });
} finally {
await browser.close();
}
}
Was able to get it to work by using user-agents. Dynamic pages now load just fine on Heroku; requests don't time out every single time anymore.
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
var userAgent = require("user-agents");
...
app.get("/route-name", async (req, res) => {
let url = req.query.url;
let browser = await puppeteer.launch({
args: ["--no-sandbox"],
});
let page = await browser.newPage();
try {
await page.setUserAgent(userAgent.toString()); // added this
await page.goto(url, {
timeout: 30000,
waitUntil: "newtorkidle2", // or "networkidle0", depending on what you need
});
res.send({ data: await page.content() });
} catch (e) {
res.send({ data: null });
} finally {
await browser.close();
}
});
In a NodeJS v10.x.x environment, when trying to create a PDF page from some HTML code, I'm getting a closed page issue every time I try to do something with it (setCacheEnabled, setRequestInterception, etc...):
async (page, data) => {
try {
const {options, urlOrHtml} = data;
const finalOptions = { ...config.puppeteerOptions, ...options };
// Set caching flag (if provided)
const cache = finalOptions.cache;
if (cache != undefined) {
delete finalOptions.cache;
await page.setCacheEnabled(cache); //THIS LINE IS CAUSING THE PAGE TO BE CLOSED
}
// Setup timeout option (if provided)
let requestOptions = {};
const timeout = finalOptions.timeout;
if (timeout != undefined) {
delete finalOptions.timeout;
requestOptions.timeout = timeout;
}
requestOptions.waitUntil = 'networkidle0';
if (urlOrHtml.match(/^http/i)) {
await page.setRequestInterception(true); //THIS LINE IS CAUSING ERROR DUE TO THE PAGE BEING ALREADY CLOSED
page.once('request', request => {
if(finalOptions.method === "POST" && finalOptions.payload !== undefined) {
request.continue({method: 'POST', postData: JSON.stringify(finalOptions.payload)});
}
});
// Request is for a URL, so request it
await page.goto(urlOrHtml, requestOptions);
}
return await page.pdf(finalOptions);
} catch (err) {
logger.info(err);
}
};
I read somewhere that this issue could be caused due to some await missing, but that doesn't look like my case.
I'm not using directly puppeteer, but this library that creates a cluster on top of it and handles processes:
https://github.com/thomasdondorf/puppeteer-cluster
You already gave the solution, but as this is a common problem with the library (I'm the author 🙂) I would like to provide some more insights.
How the task function works
When a job is queued and ready to be executed, puppeteer-cluster will create a page and call the task function (given to cluster.task) with the created page object and the queued data. The cluster then waits until the Promise is finished (fulfilled or rejected) and will close the page and execute the next job in the queue.
As an async-function is implicitly creating a Promise, this means as soon as the async-function given to the cluster.task function is finished, the page is closed. There is no magic happening to determine if the page might be used in the future.
Waiting for asynchronous events
Below is a code sample with a common mistake. The user might want to wait for an external event before closing the page as in the (not working) example below:
Non-working (!) code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
setTimeout(() => { // user is waiting for an asynchronous event
await page.evaluate(/* ... */); // Will throw an error as the page is already closed
}, 1000);
});
In this code, the page is already closed before the asynchronous function is executed. To correct way to do this would be to return a Promise instead.
Working code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
// will wait until the Promise resolves
await new Promise(resolve => {
setTimeout(() => { // user is waiting for an asynchronous event
try {
await page.evalute(/* ... */);
resolve();
} catch (err) {
// handle error
}
}, 1000);
});
});
In this code sample, the task function waits until the inner promise is resolved until it resolves the function. This will keep the page open until the asynchronous function calls resolve. In addition, the code uses a try..catch block as the library is not able to catch events thrown inside asynchronous code blocks.
I got it.
I was indeed forgetting an await to the call that was made to the function I posted.
That call was in another file that I use fot the cluster instance creation:
async function createCluster() {
//We will protect our app with a Cluster that handles all the processes running in our headless browser
const cluster = await Cluster.launch({
concurrency: Cluster[config.cluster.concurrencyModel],
maxConcurrency: config.cluster.maxConcurrency
});
// Event handler to be called in case of problems
cluster.on('taskerror', (err, data) => {
console.log(`Error on cluster task... ${data}: ${err.message}`);
});
// Incoming task for the cluster to handle
await cluster.task(async ({ page, data }) => {
main.postController(page, data); // <-- I WAS MISSING A return await HERE
});
return cluster;
}
As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL:
const cluster = require('cluster');
const express = require('express');
const bodyParser = require('body-parser');
const puppeteer = require('puppeteer');
async function getScreenshot(domain) {
let screenshot;
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'] });
const page = await browser.newPage();
try {
await page.goto('http://' + domain + '/', { timeout: 60000, waitUntil: 'networkidle2' });
} catch (error) {
try {
await page.goto('http://' + domain + '/', { timeout: 120000, waitUntil: 'networkidle2' });
screenshot = await page.screenshot({ type: 'png', encoding: 'base64' });
} catch (error) {
console.error('Connecting to: ' + domain + ' failed due to: ' + error);
}
await page.close();
await browser.close();
return screenshot;
}
if (cluster.isMaster) {
const numOfWorkers = require('os').cpus().length;
for (let worker = 0; worker < numOfWorkers; worker++) {
cluster.fork();
}
cluster.on('exit', function (worker, code, signal) {
console.debug('Worker ' + worker.process.pid + ' died with code: ' + code + ', and signal: ' + signal);
Cluster.fork();
});
cluster.on('message', function (handler, msg) {
console.debug('Worker: ' + handler.process.pid + ' has finished working on ' + msg.domain + '. Exiting...');
if (Cluster.workers[handler.id]) {
Cluster.workers[handler.id].kill('SIGTERM');
}
});
} else {
const app = express();
app.use(bodyParser.json());
app.listen(80, function() {
console.debug('Worker ' + process.pid + ' is listening to incoming messages');
});
app.post('/screenshot', (req, res) => {
const domain = req.body.domain;
getScreenshot(domain)
.then((screenshot) =>
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(200).json({ screenshot: screenshot });
})
.catch((error) => {
try {
process.send({ domain: domain });
} catch (error) {
console.error('Error while exiting worker ' + process.pid + ' due to: ' + error);
}
res.status(500).json({ error: error });
});
});
}
Some explanation:
Each time a request arrives a worker will process it and kill itself at the end
Each worker creates a new browser instance with a single page, and if a page took more than 60sec to load, it will retry reloading it (in the same page because maybe some resources has already been loaded) with timeout of 120sec
Once finished both the page and the browser will be closed
My problem is that some legitimate domains get errors that I can't explain:
Error: Protocol error (Page.navigate): Target closed.
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I read at some git issue (that I can't find now) that it can happen when the page redirects and adds 'www' at the start, but I'm hoping it's false...
Is there something I'm missing?
What "Target closed" means
When you launch a browser via puppeteer.launch it will start a browser and connect to it. From there on any function you execute on your opened browser (like page.goto) will be send via the Chrome DevTools Protocol to the browser. A target means a tab in this context.
The Target closed exception is thrown when you are trying to run a function, but the target (tab) was already closed.
Similar error messages
The error message was recently changed to give more meaningful information. It now gives the following message:
Error: Protocol error (Target.activateTarget): Session closed. Most likely the page has been closed.
Why does it happen
There are multiple reasons why this could happen.
You used a resource that was already closed
Most likely, you are seeing this message because you closed the tab/browser and are still trying to use the resource. To give an simple example:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await browser.close();
await page.goto('http://www.google.com');
In this case the browser was closed and after that, a page.goto was called resulting in the error message. Most of the time, it will not be that obvious. Maybe an error handler already closed the page during a cleanup task, while your script is still crawling.
The browser crashed or was unable to initialize
I also experience this every few hundred requests. There is an issue about this on the puppeteer repository as well. It seems to be the case, when you are using a lot of memory or CPU power. Maybe you are spawning a lot of browser? In these cases the browser might crash or disconnect.
I found no "silver bullet" solution to this problem. But you might want to check out the library puppeteer-cluster (disclaimer: I'm the author) which handles these kind of error cases and let's you retry the URL when the error happens. It can also manage a pool of browser instances and would also simplify your code.
For me removing '--single-process' from args fixed the issue.
puppeteerOptions: {
headless: true,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--deterministic-fetch',
'--disable-features=IsolateOrigins',
'--disable-site-isolation-trials',
// '--single-process',
],
}
I was just experiencing the same issue every time I tried running my puppeteer script*. The above did not resolve this issue for me.
I got it to work by removing and reinstalling the puppeteer package:
npm remove puppeteer
npm i puppeteer
*I only experienced this issue when setting the headless option to 'false`
I've wound up at this thread a few times, and the typical culprit is that I forgot to await a Puppeteer page call that returned a promise, causing a race condition.
Here's a minimal example of what this can look like:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
page.goto("https://www.stackoverflow.com"); // whoops, forgot await!
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output is:
C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217
this._callbacks.set(id, { resolve, reject, error: new Error(), method });
^
Error: Protocol error (Page.navigate): Target closed.
at C:\Users\foo\Desktop\puppeteer-playground\node_modules\puppeteer\lib\cjs\puppeteer\common\Connection.js:217:63
In this case, it seems like an unmissable error, but in a larger chunk of code and the promise is nested or in a condition, it's easy to overlook.
You'll get a similar error for forgetting to await a page.click() or other promise call, for example, Error: Protocol error (Runtime.callFunctionOn): Target closed., which can be seen in the question UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Target closed. (Puppeteer)
This is a contribution to the thread as a canonical resource for the error and may not be the solution to OP's problem, although the fundamental race condition seems to be a likely cause.
In 2021 I'm receiving the very similar following error Error: Error pdf creationError: Protocol error (Target.setDiscoverTargets): Target closed., I solved it by playing with different args, so if your production server has a pipe:true flag in puppeteer.launch obj it will produce errors.
Also --disable-dev-shm-usage flag do the trick
The solution below works for me:
const browser = await puppeteer.launch({
headless: true,
// pipe: true, <-- delete this property
args: [
'--no-sandbox',
'--disable-dev-shm-usage', // <-- add this one
],
});
Check your jest-puppeteer.config.js file.
I made the below mistake
module.exports = {
launch: {
headless: false,
browserContext: "default",
},
};
and after correcting it as below
module.exports = {
launch: {
headless: false
},
browserContext: "default",
};
everything worked just fine!!!
After hours of frustrations I realized that this happens when it goes to a new page and I need to be using await page.waitForNavigation() before I do anything and after I press a button or do any action that will cause it to redirect.