Running puppeteer with containerized chrome binary from another container - node.js

I want my code using puppeteer running in one container and using (perhaps by "executablePath" launch param?) a chrome binary from another container. Is this possible? any known solution for that?
Use case:
worker code runs in multiple k8 pods (as containers) . "Sometime" (might be often or not often) worker needs to run code utilizing puppeteer. I don't want to make the docker gigantic and limited as the puppeteer/chrome container is (1.5 GB If I recall correctly) I just want my code to be supplied with the needed binary from another running container
Notice: this is not a question about containerizing puppeteer, I know that's a possibility

Along with this answer here and here, here is how you can do this. Basically the idea is to run chrome on different docker and connect to it from another, then use that whenever we need. It will need some maintenance, error handling, timeouts and concurrency, but that is not the issue here.
Master
You save puppeteer on master account, you do not install chrome when installing puppeteer there with PUPPETEER_SKIP_CHROMIUM_DOWNLOAD = true, use this one to connect to your worker puppeteers running on another docker.
const browser = await puppeteer.connect({
browserWSEndpoint: "ws://123.123.123.123:8080",
ignoreHTTPSErrors: true
});
Worker
You setup a fully running chrome here, expose the websocket. There are different ways to do this. Here is the simplest one.
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxyy.ws(req, socket, head, { target })
})
.listen(8080);

Related

Is it possible to use puppeteer to pass a Javascript object to nodejs?

Background
I am using Posenet (see the in browser demo here) for keypoint detection. I have set it up to run on a WebRTC MediaStream, s.t.:
Client: Runs in a chrome tab on machine A. Initializes WebRTC connection and sends a MediaStream to Server. Receives back real time keypoint data from Server via WebRTC's DataChannel.
Server: Runs in a chrome tab on machine B, receives a WebRTC stream and passes the corresponding MediaStream to Posenet. Posenet does its thing and computes keypoints. This keypoint data is then send back to the client via WebRTC's DataChannel (if you have a better idea, I'm all ears).
Problem: I would like to have the server receive multiple streams from various clients and run Posenet on each, sending real time keypoint data to all clients. Though I'm not thrilled about the server utilizing Chrome, I am fine with using puppeteer and Chrome's headless mode for now, mainly to abstract away WebRTC's complexity.
Approaches
I have tried two approaches, being heavily in favor of approach #2:
Approach #1
Run #tensorflow/tfjs inside the puppeteer context (i.e. inside a headless chrome tab). However, I cannot seem to get the PoseNet Browser Demo working in headless mode, due to some WebGL error (it does work in non-headless mode though). I tried the following (passing args to puppeteer.launch() to enable WebGL, though I haven't had any luck - see here and here for reference):
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({
headless: true,
args: ['--enable-webgl-draft-extensions', '--enable-webgl-image-chromium', '--enable-webgl-swap-chain', '--enable-webgl2-compute-context']
});
const page = await browser.newPage();
await page.goto('https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html', {
waitUntil: 'networkidle2'
});
// Make chromium console calls available to nodejs console
page.on('console', msg => {
for (let i = 0; i < msg.args().length; ++i)
console.log(`${i}: ${msg.args()[i]}`);
});
}
main();
In headless mode, I am receiving this error message.
0: JSHandle:Initialization of backend webgl failed
0: JSHandle:Error: WebGL is not supported on this device
This leaves me with question #1: How do I enable WebGL in puppeteer?
Approach #2
Preferably, I would like to run posenet using the #tensorflow/tfjs-node backend, to accelerate computation. Therefore, I would to link puppeteer and #tensorflow/tfjs-node, s.t.:
The puppeteer-chrome-tab talks WebRTC with the client. It makes a Mediastream object available to node.
node takes this MediaStream and passes it to posenet, (and thus #tensorflow/tfjs-node), where the machine learning magic happens. node then passes detected keypoints back to puppeteer-chrome-tab which uses its RTCDataChannel to communicate them back to client.
Problem
The problem is that I cannot seem to get access to puppeteer's MediaStream object within node, to pass this object to posenet. I'm only getting access to JSHandles and ElementHandles. Is it possible to pass the javascript object associated with the handle to node?
Concretely, this error is thrown:
UnhandledPromiseRejectionWarning: Error: When running in node, pixels must be an HTMLCanvasElement like the one returned by the `canvas` npm package
at NodeJSKernelBackend.fromPixels (/home/work/code/node_modules/#tensorflow/tfjs-node/dist/nodejs_kernel_backend.js:1464:19)
at Engine.fromPixels (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/engine.js:749:29)
at fromPixels_ (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/ops/browser.js:85:28)
at Object.fromPixels (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/ops/operation.js:46:29)
at toInputTensor (/home/work/code/node_modules/#tensorflow-models/posenet/dist/util.js:164:60)
at /home/work/code/node_modules/#tensorflow-models/posenet/dist/util.js:198:27
at /home/work/code/node_modules/#tensorflow/tfjs-core/dist/engine.js:349:22
at Engine.scopedRun (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/engine.js:359:23)
at Engine.tidy (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/engine.js:348:21)
at Object.tidy (/home/work/code/node_modules/#tensorflow/tfjs-core/dist/globals.js:164:28)
Logging the pixels argument that is passed to NodeJSKernelBackend.prototype.fromPixels = function (pixels, numChannels) {..}, it evaluates to an ElementHandle. I am aware that I can access serializable properties of a Javascript object, using puppeteer's page.evaluate. However, if I were to pass the CanvasRenderingContext2D's imageData (using the method getImageData() to node by calling puppeteer.evaluate(..), this means stringifying an entire raw image and then reconstructing it in node's context.
This leaves me with question #2: Is there any way to make an object from puppeteer's context accessible (read-only) directly inside node, without having to go through e.g. puppeteer.evaluate(..)?
I recommend another approach with is to ditch the idea of using puppeteer on the server-side and instead implementing an actual WebRTC client in Node.js which then directly uses PoseNet via #tensorflow/tfjs-node.
Why not to use puppeteer on the server-side
Using puppeteer on the server-side introduces a lot of complexity. On top of active WebRTC connections to multiple clients you now also have to manage one browser (or one tab at least) per connection. So, not only do you have to think about what happens when the connection to the clients fails, but you also have to prepare for other scenarios like browser crashes, page crashes, WebGL support (per page), document in the browser not loading, memory/CPU usage of the browser instances, ...
That said, let's go over your approaches.
Approach 1: Running Tensorflow.js inside puppeteer
You should be able to get this running by using only the cpu backend. You can set the backend like this before using any other code:
tf.setBackend('cpu');
You might also be able to get WebGL running (as you are not the only one having problems with WebGL and puppeteer). But even if you get it running, you are now running a Node.js script to start a Chrome browser that starts a WebRTC session and Tensorflow.js training inside a website. Complexity-wise, this will be very hard to debug if any problems occur...
Approach 2: Transferring the data between puppeteer and Node.js
This approach will be nearly impossible without a large slowdown (regarding the sending and receiving of frames). puppeteer needs to serialize any exchanged data. There is no such thing as shared memory or shared data objects between the Node.js and the browser environment. This means you would have to serialize each frame (all the pixels...) to transfer them from the browser environment to Node.js. Performance-wise, this might work okay for small images, but will become worse the bigger your images are.
All in all, you are introducing a lot of complexity if you want to go with one of your two approaches. Therefore, let's look at the alternative.
Alternative approach: Send your video stream directly to your server
Instead of using puppeteer to establish a WebRTC connection, you can directly implement a WebRTC peer. I read form your question that you fear the complexity, but it is probably worth the hassle.
To implement a WebRTC server, you can use the library node-webrtc, which allows to implement a WebRTC peer on the server-side. There are multiple examples, of which one is very interesting for your use case. This is the video-compositing example, which establishes a connection between client (browser) and server (Node.js) to stream a video. Then the server will modify the sent frames and put a "watermark" on top of them.
Code Sample
The following code shows the most relevant lines from the video-compositing example. The code reads a frame from the input stream and creates a node-canvas object from it.
const lastFrameCanvas = createCanvas(lastFrame.width, lastFrame.height);
const lastFrameContext = lastFrameCanvas.getContext('2d', { pixelFormat: 'RGBA24' });
const rgba = new Uint8ClampedArray(lastFrame.width * lastFrame.height * 4);
const rgbaFrame = createImageData(rgba, lastFrame.width, lastFrame.height);
i420ToRgba(lastFrame, rgbaFrame);
lastFrameContext.putImageData(rgbaFrame, 0, 0);
context.drawImage(lastFrameCanvas, 0, 0);
You now have a canvas object, which you can use the feed into the PoseNet like this:
const net = await posenet.load();
// ...
const input = tf.browser.fromPixels(lastFrameCanvas);
const pose = await net.estimateSinglePose(input, /* ... */);
The resulting data now needs to be transferred back to the client which can be done by using a data channel. There is also an example (ping-pong) in the repository regarding that, which is much simpler than the video example.
Although you might fear the complexity of using node-webrtc, I recommend giving this approach and node-webrtc-examples a try. You can check out the repository first. All examples are ready to try and play around with.

Running my Puppeteer app within PM2's cluster mode doesn't take advantage of the multiple processes

While running my Puppeteer app with PM2's cluster mode enabled, during concurrent requests, only one of the processes seems to be utilized instead of all 4 (1 for each of my cores). Here's the basic flow of my program:
helpers.startChrome()
.then((resp) => {
http.createServer(function (req, res) {
const {webSocketUrl} = JSON.parse(resp.body);
let browser = await puppeteer.connect({browserWSEndpoint: webSocketUrl});
const page = await browser.newPage();
... //do puppeteer stuff
await page.close();
await browser.disconnect();
})
})
and here is the startChrome() function:
startChrome: function(){
return new Promise(async (resolve, reject) => {
const opts = {
//chromeFlags: ["--no-sandbox", "--headless", "--use-gl=egl"],
userDataDir: "D:/pupeteercache",
output: 'json'
};
// Launch chrome using chrome-launcher.
const chrome = await chromeLauncher.launch(opts);
opts.port = chrome.port;
// Connect to it using puppeteer.connect().
resp = await util.promisify(request)(`http://localhost:${opts.port}/json/version`);
resolve(resp);
})
}
First, I use a package called chrome-launcher to start up chrome, I then setup a simple http server that listens for incoming requests to my app. When a request is recieved, i connect to the chrome endpoint i setup through chrome-launcher at the beginning.
When i now try to run this app within PM2's cluster mode, 4 separate chrome tabs are opened up (not sure why it works this way but alright), and everything seems to be running fine. But when I send the server 10 concurrent requests to test and see if all processes are getting used, only the first one is. I know this because when i run PM2 monit, only the first process is using any memory.
Can someone explain to me why all the processes aren't utilized? Is it because of how i'm using chrome-launcher to only use one browser with multiple tabs instead of running multiple browsers?
You cannot use the same user directory for multiple instances at same time. If you pass a user directory, no matter what kind of launcher it is, it will automatically pick the running process and create a new tab on that instead.
Puppeteer creates a temporary profile whenever you want to launch the browser. So if you want to utilize 4 instances, pass it a different user data directory on each instance.

Firefox proxy server for Puppeteer Node.js

While my setting up my node.js puppeteer proxy server I found little misunderstandings. My software is Linux Mint 19, I run puppeteer on Node.js. All works well when I run my command:
const puppeteer = require('puppeteer');
const pptrFirefox = require('puppeteer-firefox');
(async () => {
const browser = await puppeteer.launch({
headless: false,
args:[ '--proxy-server=socks5://127.0.0.1:9050']
});
const page = await browser.newPage();
await page.goto('http://www.whatismyproxy.com/');
await page.screenshot({path: 'example.png'}).then(()=>{console.log("I took screenshot")});
await browser.close();
})();
proxy run on app tor in the system. While my IP is changed and privacy works, google and other websites recognize me as a bot (even without proxy server ON). When I change into "puppeteer-firefox" proxy flags do not work, but I am not recognized as a bot.
My goal is to not be recognized as a bot and run my puppeteer section incognito (in future from Tails linux, through proxy). I am already very excited from your answers :). I ensure you this is only for development purposes. regards to all
Although Puppeteer and Puppeteer-Firefox share the same API, the arguments you send using the args arguments are Browser specific.
Firefox doesn't support passing a proxy from the command arguments. But you can create a profile and launch Firefox using that profile. There are many posts explaining how to create a profile and launch Firefox with that profile. This is one of them.

Puppeteer proxy connection with VPN Gate

I've been starting a small project on Node.js and Puppeteer that requires the use of a proxy and i've had some problem connecting through VPNGate's proxy servers.
this is the code i've used so far:
async function getIpTest(){
ips= await new ipGeneration(40);
console.log(ips['#HostName']);
proxConnect= '--proxy-server=' + ips['#HostName'] + '.opengw.net';
const browser= await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [proxConnect]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({'Proxy-Authorization': 'Basic' + Buffer.from('vpn:vpn').toString('base64')});
await page.goto('http://www.whatsmyip.org/');
}
where
IPGeneration()
is just a module i made to parse their CSV file.
and
proxConnect= '--proxy-server=' + ips['#HostName'] + '.opengw.net';
is part of the parsing and yeld same results if i put as string directly in puppeteer.launch args
I tried changing the port, or not using any. I tried a dozen of different proxy adresses, and tried to connect directly to IP or hostname
I've tried to look everywhere online but can't seem to find why it is not working (should i mention everything works without trying to launch puppeteer with the proxy).
Is it just VPN Gate that won't work with puppeteer?
EDIT: i was messing around and see that they have config data to connect through openVPN. Could it be a simple working solution to use node>openVPN>VPN Gate servers? Ill try this now

How to wait for a Redis connection?

I'm currently trying to use Node.js Kue for processing jobs in a queue, but I believe I'm not doing it right.
Indeed the way I'm working now, I have two different services (which in this case I'm running with Docker Compose): one Web API built with Express with sends jobs to the queue and one processing module. The issue here is with the processing module.
I've coded it as follows:
var kue = require('kue');
var config = require('./config');
var queue = kue.createQueue({
prefix: config.redis.queuePrefix,
redis: {
port: config.redis.port,
host: config.redis.host
}
});
queue.process('jobType', function (job, done) {
// do processing here...
});
When we run this with Node, it sits there waiting for things to be placed on the queue to do the processing.
There are two issues however:
It needs that Redis be available before running this module. If we run this without Redis already available, it crashes because the host is not accessible and ends the process.
If Redis suddenly becomes unavailable, the processing module also crashes because it cannot stablish the connection and the process is killed.
How can I avoid these problems?
My guess is that I should somehow make the code "wait" for Redis, but I have no idea on how to do this.
How can this be done in this case?
You can use promise to wait until redis is loaded. Then run your module.
loadRedis().then(() => {
//load your module
})
Or you can use generator to "stop" until redis is loaded.
function*(){
const redisLoaded = yield loadRedis();
//load your module
}

Resources