I have a scraping algorithm in nodejs with puppeteer which scrapes 5 pages concurrently and when it finishes with one page it pulls the next url from a queue and open it in the same page. The CPU is always at 100%. How to make puppeteer use less cpu?
This process is running on a digitaloceans droplet with 4gb of RAM and 2 vCPUs.
I've launched the puppeteer instance with some args to try to make it lighter but nothing happened
puppeteer.launch({
args: ['--no-sandbox', "--disable-accelerated-2d-canvas","--disable-gpu"],
headless: true,
});
Are there any other args I can give to make it less CPU hungry?
I've also blocked images loading
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType().toUpperCase() === 'IMAGE')
request.abort();
else
request.continue();
});
my default args, please test it and tell me if this run smoothly.
Please note that --no-sandbox isn't secure when navigate to vulnerable sites, but it's OK if you're testing your own sites or apps. So make sure, you're know what you're doing.
const options = {
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu'
],
headless: true
}
return await puppeteer.launch(options)
There's a few factors that can play into this. First, check if the site(s) that you're visiting using a lot of CPU. Things like canvas and other scripts can easily chew through your CPU, especially when it comes to using canvas.
If you're using docker to do your deployment then make sure you use dumb-init. There's a nice repo here that goes into why you'd use such a thing, but essentially the process ID that gets assigned in your docker image has some hiccups when it comes to handling termination:
EXPOSE 8080
ENTRYPOINT ["dumb-init", "--"]
CMD ["yarn", "start"]
This is something I've witnessed and fixed on browserless.io as I use docker to handle deployments, CPU usage being one of them.
To avoid parallel execution which causes high CPU usage , i had to execute jobs sequentially using
p-iteration NPM package. In my case it's not a problem because my jobs don't take too much time.
You can use either forEachSeries or mapSeries function depending on you scenario.
Related
Background: Win10 with HyperV with a Win10 VM and a linux/docker VM (with official node docker with xvfb based on https://github.com/beemi/puppeteer-headful). Both are configured with PIA VPN (win10 using their app, linux/docker using openvpn)
See test code below:
const puppeteer = require('puppeteer')
async function getpage() {
const browser = await puppeteer.launch({ executablePath: process.env.PUPPETEER_EXEC_PATH, headless:false, args: ['--no-sandbox', '--disable-setuid-sandbox', '--font-render-hinting=none', '--disable-dev-shm-usage' ], ignoreDefautArgs: ["--enable-automation"]})
const page = await browser.newPage()
await page.setDefaultNavigationTimeout(0);
console.log(new Date().toLocaleTimeString())
await page.goto('https://www.stackoverflow.com', {waitUntil: 'networkidle0'})
console.log(new Date().toLocaleTimeString())
await page.close()
await browser.close()
}
getpage()
Win10VM connected to VPN takes 3 seconds
dockerVM connect to VPN takes 132 seconds
dockerVM not connect to VPN takes 1 second
note that headless false/true does not affect time.
Since the Win10VM is fast, I don't think there is any issue with the VPN. I've tried to curl a large file inside the docker with the VPN and I get fast speeds so I don't think its an issue with the container (or the VPN). I've also tried a variety of different modern websites and get similar results.
For some reason, node or puppeteer does not seem to like going through a vpn
I wanted to know if anyone using Puppeteer-Cluster could elaborate on how the Cluster.Launch({settings}) protects against sharing of cookies and web data between pages in different context.
Do the browser contexts here, actually block cookies and user-data is not shared or tracked? Browserless' now infamous page seems to think no, here and that .launch({}) should be called on the task, not ahead of the queue.
So my question is, how do we know if puppeteer-cluster is sharing cookies / data between queued tasks? And what kind of options are in the library to lower the chances of being labeled a bot?
Setup: I am using page.authenticate with a proxy service, random user agent, and still getting blocked(403) occasionally by the site which I'm performing the test.
async function run() {
// Create a cluster with 2 workers
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER, //Cluster.CONCURRENCY_PAGE,
maxConcurrency: 2, //5, //25, //the number of chromes open
monitor: false, //true,
puppeteerOptions: {
executablePath,
args: [
"--proxy-server=pro.proxy.net:2222",
"--incognito",
"--disable-gpu",
"--disable-dev-shm-usage",
"--disable-setuid-sandbox",
"--no-first-run",
"--no-sandbox",
"--no-zygote"
],
headless: false,
sameDomainDelay: 1000,
retryDelay: 3000,
workerCreationDelay: 3000
}
});
// Define a task
await cluster.task(async ({ page, data: url }) => {
extract(url, page); //call the extract
});
//task
const extract = async ({ page, data: dataJson }) => {
page.setExtraHTTPHeaders({headers})
await page.authenticate({
username: proxy_user,
password: proxy_pass
});
//Randomized Delay
await delay(2000 + (Math.floor(Math.random() * 998) + 1));
const response = await page.goto(dataJson.Url);
}
//loop over inputs, and queue them into cluster
var dataJson = {
url: url
};
cluster.queue(dataJson, extract);
}
// Shutdown after everything is done
await cluster.idle();
await cluster.close();
}
Direct answer
Author of puppeteer-cluster here. The library does not actively block cookies, but makes use of browser.createIncognitoBrowserContext():
Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.
In addition, the docs state that "Incognito browser contexts don't write any browsing data to disk" (source), so that restarting the browser cannot reuse any cookies from disk as there were no data written.
Regarding the library, this means when a job is executed, a new incognito context is created, which does not share any data (cookies, etc.) with other contexts. So as long as Chromium properly implements the incognito browser contexts, there is no data shared between the jobs.
The page you linked only talks about browser.newPage() (which shares cookies between pages) and not about incognito contexts.
Why websites might identify you as a bot
Some websites will still block you, because they use different measures to detect bots. There are headless browser detection tests as well as fingerprinting libraries that might report you as bot if the user agent does not match the browser fingerprint. You might be interested in this answer by me that provides some more detailed explanation how these fingerprints work.
You can try to use a library like puppeteer-extra that comes with a stealth plugin to help you solve the problem. However, this basically is a cat-and-mouse game. The fingerprinting tests might be changed or another sites might use a different "detection" mechanism. All-in-all, there is no way to guarantee that a website does not detect you.
In case you want to use puppeteer-extra, be aware that you can use it in conjunction with puppeteer-cluster (example code).
You can always use PlayWright which is way harder to be recognized as a bot than puppeteer and has options of using multiple browsers etc.
While running my Puppeteer app with PM2's cluster mode enabled, during concurrent requests, only one of the processes seems to be utilized instead of all 4 (1 for each of my cores). Here's the basic flow of my program:
helpers.startChrome()
.then((resp) => {
http.createServer(function (req, res) {
const {webSocketUrl} = JSON.parse(resp.body);
let browser = await puppeteer.connect({browserWSEndpoint: webSocketUrl});
const page = await browser.newPage();
... //do puppeteer stuff
await page.close();
await browser.disconnect();
})
})
and here is the startChrome() function:
startChrome: function(){
return new Promise(async (resolve, reject) => {
const opts = {
//chromeFlags: ["--no-sandbox", "--headless", "--use-gl=egl"],
userDataDir: "D:/pupeteercache",
output: 'json'
};
// Launch chrome using chrome-launcher.
const chrome = await chromeLauncher.launch(opts);
opts.port = chrome.port;
// Connect to it using puppeteer.connect().
resp = await util.promisify(request)(`http://localhost:${opts.port}/json/version`);
resolve(resp);
})
}
First, I use a package called chrome-launcher to start up chrome, I then setup a simple http server that listens for incoming requests to my app. When a request is recieved, i connect to the chrome endpoint i setup through chrome-launcher at the beginning.
When i now try to run this app within PM2's cluster mode, 4 separate chrome tabs are opened up (not sure why it works this way but alright), and everything seems to be running fine. But when I send the server 10 concurrent requests to test and see if all processes are getting used, only the first one is. I know this because when i run PM2 monit, only the first process is using any memory.
Can someone explain to me why all the processes aren't utilized? Is it because of how i'm using chrome-launcher to only use one browser with multiple tabs instead of running multiple browsers?
You cannot use the same user directory for multiple instances at same time. If you pass a user directory, no matter what kind of launcher it is, it will automatically pick the running process and create a new tab on that instead.
Puppeteer creates a temporary profile whenever you want to launch the browser. So if you want to utilize 4 instances, pass it a different user data directory on each instance.
While my setting up my node.js puppeteer proxy server I found little misunderstandings. My software is Linux Mint 19, I run puppeteer on Node.js. All works well when I run my command:
const puppeteer = require('puppeteer');
const pptrFirefox = require('puppeteer-firefox');
(async () => {
const browser = await puppeteer.launch({
headless: false,
args:[ '--proxy-server=socks5://127.0.0.1:9050']
});
const page = await browser.newPage();
await page.goto('http://www.whatismyproxy.com/');
await page.screenshot({path: 'example.png'}).then(()=>{console.log("I took screenshot")});
await browser.close();
})();
proxy run on app tor in the system. While my IP is changed and privacy works, google and other websites recognize me as a bot (even without proxy server ON). When I change into "puppeteer-firefox" proxy flags do not work, but I am not recognized as a bot.
My goal is to not be recognized as a bot and run my puppeteer section incognito (in future from Tails linux, through proxy). I am already very excited from your answers :). I ensure you this is only for development purposes. regards to all
Although Puppeteer and Puppeteer-Firefox share the same API, the arguments you send using the args arguments are Browser specific.
Firefox doesn't support passing a proxy from the command arguments. But you can create a profile and launch Firefox using that profile. There are many posts explaining how to create a profile and launch Firefox with that profile. This is one of them.
I want my code using puppeteer running in one container and using (perhaps by "executablePath" launch param?) a chrome binary from another container. Is this possible? any known solution for that?
Use case:
worker code runs in multiple k8 pods (as containers) . "Sometime" (might be often or not often) worker needs to run code utilizing puppeteer. I don't want to make the docker gigantic and limited as the puppeteer/chrome container is (1.5 GB If I recall correctly) I just want my code to be supplied with the needed binary from another running container
Notice: this is not a question about containerizing puppeteer, I know that's a possibility
Along with this answer here and here, here is how you can do this. Basically the idea is to run chrome on different docker and connect to it from another, then use that whenever we need. It will need some maintenance, error handling, timeouts and concurrency, but that is not the issue here.
Master
You save puppeteer on master account, you do not install chrome when installing puppeteer there with PUPPETEER_SKIP_CHROMIUM_DOWNLOAD = true, use this one to connect to your worker puppeteers running on another docker.
const browser = await puppeteer.connect({
browserWSEndpoint: "ws://123.123.123.123:8080",
ignoreHTTPSErrors: true
});
Worker
You setup a fully running chrome here, expose the websocket. There are different ways to do this. Here is the simplest one.
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxyy.ws(req, socket, head, { target })
})
.listen(8080);