Background: Win10 with HyperV with a Win10 VM and a linux/docker VM (with official node docker with xvfb based on https://github.com/beemi/puppeteer-headful). Both are configured with PIA VPN (win10 using their app, linux/docker using openvpn)
See test code below:
const puppeteer = require('puppeteer')
async function getpage() {
const browser = await puppeteer.launch({ executablePath: process.env.PUPPETEER_EXEC_PATH, headless:false, args: ['--no-sandbox', '--disable-setuid-sandbox', '--font-render-hinting=none', '--disable-dev-shm-usage' ], ignoreDefautArgs: ["--enable-automation"]})
const page = await browser.newPage()
await page.setDefaultNavigationTimeout(0);
console.log(new Date().toLocaleTimeString())
await page.goto('https://www.stackoverflow.com', {waitUntil: 'networkidle0'})
console.log(new Date().toLocaleTimeString())
await page.close()
await browser.close()
}
getpage()
Win10VM connected to VPN takes 3 seconds
dockerVM connect to VPN takes 132 seconds
dockerVM not connect to VPN takes 1 second
note that headless false/true does not affect time.
Since the Win10VM is fast, I don't think there is any issue with the VPN. I've tried to curl a large file inside the docker with the VPN and I get fast speeds so I don't think its an issue with the container (or the VPN). I've also tried a variety of different modern websites and get similar results.
For some reason, node or puppeteer does not seem to like going through a vpn
Related
While running my Puppeteer app with PM2's cluster mode enabled, during concurrent requests, only one of the processes seems to be utilized instead of all 4 (1 for each of my cores). Here's the basic flow of my program:
helpers.startChrome()
.then((resp) => {
http.createServer(function (req, res) {
const {webSocketUrl} = JSON.parse(resp.body);
let browser = await puppeteer.connect({browserWSEndpoint: webSocketUrl});
const page = await browser.newPage();
... //do puppeteer stuff
await page.close();
await browser.disconnect();
})
})
and here is the startChrome() function:
startChrome: function(){
return new Promise(async (resolve, reject) => {
const opts = {
//chromeFlags: ["--no-sandbox", "--headless", "--use-gl=egl"],
userDataDir: "D:/pupeteercache",
output: 'json'
};
// Launch chrome using chrome-launcher.
const chrome = await chromeLauncher.launch(opts);
opts.port = chrome.port;
// Connect to it using puppeteer.connect().
resp = await util.promisify(request)(`http://localhost:${opts.port}/json/version`);
resolve(resp);
})
}
First, I use a package called chrome-launcher to start up chrome, I then setup a simple http server that listens for incoming requests to my app. When a request is recieved, i connect to the chrome endpoint i setup through chrome-launcher at the beginning.
When i now try to run this app within PM2's cluster mode, 4 separate chrome tabs are opened up (not sure why it works this way but alright), and everything seems to be running fine. But when I send the server 10 concurrent requests to test and see if all processes are getting used, only the first one is. I know this because when i run PM2 monit, only the first process is using any memory.
Can someone explain to me why all the processes aren't utilized? Is it because of how i'm using chrome-launcher to only use one browser with multiple tabs instead of running multiple browsers?
You cannot use the same user directory for multiple instances at same time. If you pass a user directory, no matter what kind of launcher it is, it will automatically pick the running process and create a new tab on that instead.
Puppeteer creates a temporary profile whenever you want to launch the browser. So if you want to utilize 4 instances, pass it a different user data directory on each instance.
While my setting up my node.js puppeteer proxy server I found little misunderstandings. My software is Linux Mint 19, I run puppeteer on Node.js. All works well when I run my command:
const puppeteer = require('puppeteer');
const pptrFirefox = require('puppeteer-firefox');
(async () => {
const browser = await puppeteer.launch({
headless: false,
args:[ '--proxy-server=socks5://127.0.0.1:9050']
});
const page = await browser.newPage();
await page.goto('http://www.whatismyproxy.com/');
await page.screenshot({path: 'example.png'}).then(()=>{console.log("I took screenshot")});
await browser.close();
})();
proxy run on app tor in the system. While my IP is changed and privacy works, google and other websites recognize me as a bot (even without proxy server ON). When I change into "puppeteer-firefox" proxy flags do not work, but I am not recognized as a bot.
My goal is to not be recognized as a bot and run my puppeteer section incognito (in future from Tails linux, through proxy). I am already very excited from your answers :). I ensure you this is only for development purposes. regards to all
Although Puppeteer and Puppeteer-Firefox share the same API, the arguments you send using the args arguments are Browser specific.
Firefox doesn't support passing a proxy from the command arguments. But you can create a profile and launch Firefox using that profile. There are many posts explaining how to create a profile and launch Firefox with that profile. This is one of them.
I've been starting a small project on Node.js and Puppeteer that requires the use of a proxy and i've had some problem connecting through VPNGate's proxy servers.
this is the code i've used so far:
async function getIpTest(){
ips= await new ipGeneration(40);
console.log(ips['#HostName']);
proxConnect= '--proxy-server=' + ips['#HostName'] + '.opengw.net';
const browser= await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [proxConnect]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({'Proxy-Authorization': 'Basic' + Buffer.from('vpn:vpn').toString('base64')});
await page.goto('http://www.whatsmyip.org/');
}
where
IPGeneration()
is just a module i made to parse their CSV file.
and
proxConnect= '--proxy-server=' + ips['#HostName'] + '.opengw.net';
is part of the parsing and yeld same results if i put as string directly in puppeteer.launch args
I tried changing the port, or not using any. I tried a dozen of different proxy adresses, and tried to connect directly to IP or hostname
I've tried to look everywhere online but can't seem to find why it is not working (should i mention everything works without trying to launch puppeteer with the proxy).
Is it just VPN Gate that won't work with puppeteer?
EDIT: i was messing around and see that they have config data to connect through openVPN. Could it be a simple working solution to use node>openVPN>VPN Gate servers? Ill try this now
I want my code using puppeteer running in one container and using (perhaps by "executablePath" launch param?) a chrome binary from another container. Is this possible? any known solution for that?
Use case:
worker code runs in multiple k8 pods (as containers) . "Sometime" (might be often or not often) worker needs to run code utilizing puppeteer. I don't want to make the docker gigantic and limited as the puppeteer/chrome container is (1.5 GB If I recall correctly) I just want my code to be supplied with the needed binary from another running container
Notice: this is not a question about containerizing puppeteer, I know that's a possibility
Along with this answer here and here, here is how you can do this. Basically the idea is to run chrome on different docker and connect to it from another, then use that whenever we need. It will need some maintenance, error handling, timeouts and concurrency, but that is not the issue here.
Master
You save puppeteer on master account, you do not install chrome when installing puppeteer there with PUPPETEER_SKIP_CHROMIUM_DOWNLOAD = true, use this one to connect to your worker puppeteers running on another docker.
const browser = await puppeteer.connect({
browserWSEndpoint: "ws://123.123.123.123:8080",
ignoreHTTPSErrors: true
});
Worker
You setup a fully running chrome here, expose the websocket. There are different ways to do this. Here is the simplest one.
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxyy.ws(req, socket, head, { target })
})
.listen(8080);
I have a scraping algorithm in nodejs with puppeteer which scrapes 5 pages concurrently and when it finishes with one page it pulls the next url from a queue and open it in the same page. The CPU is always at 100%. How to make puppeteer use less cpu?
This process is running on a digitaloceans droplet with 4gb of RAM and 2 vCPUs.
I've launched the puppeteer instance with some args to try to make it lighter but nothing happened
puppeteer.launch({
args: ['--no-sandbox', "--disable-accelerated-2d-canvas","--disable-gpu"],
headless: true,
});
Are there any other args I can give to make it less CPU hungry?
I've also blocked images loading
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType().toUpperCase() === 'IMAGE')
request.abort();
else
request.continue();
});
my default args, please test it and tell me if this run smoothly.
Please note that --no-sandbox isn't secure when navigate to vulnerable sites, but it's OK if you're testing your own sites or apps. So make sure, you're know what you're doing.
const options = {
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu'
],
headless: true
}
return await puppeteer.launch(options)
There's a few factors that can play into this. First, check if the site(s) that you're visiting using a lot of CPU. Things like canvas and other scripts can easily chew through your CPU, especially when it comes to using canvas.
If you're using docker to do your deployment then make sure you use dumb-init. There's a nice repo here that goes into why you'd use such a thing, but essentially the process ID that gets assigned in your docker image has some hiccups when it comes to handling termination:
EXPOSE 8080
ENTRYPOINT ["dumb-init", "--"]
CMD ["yarn", "start"]
This is something I've witnessed and fixed on browserless.io as I use docker to handle deployments, CPU usage being one of them.
To avoid parallel execution which causes high CPU usage , i had to execute jobs sequentially using
p-iteration NPM package. In my case it's not a problem because my jobs don't take too much time.
You can use either forEachSeries or mapSeries function depending on you scenario.