Is Puppeteer-Cluster Stealthy enough to pass bot tests? - node.js

I wanted to know if anyone using Puppeteer-Cluster could elaborate on how the Cluster.Launch({settings}) protects against sharing of cookies and web data between pages in different context.
Do the browser contexts here, actually block cookies and user-data is not shared or tracked? Browserless' now infamous page seems to think no, here and that .launch({}) should be called on the task, not ahead of the queue.
So my question is, how do we know if puppeteer-cluster is sharing cookies / data between queued tasks? And what kind of options are in the library to lower the chances of being labeled a bot?
Setup: I am using page.authenticate with a proxy service, random user agent, and still getting blocked(403) occasionally by the site which I'm performing the test.
async function run() {
// Create a cluster with 2 workers
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER, //Cluster.CONCURRENCY_PAGE,
maxConcurrency: 2, //5, //25, //the number of chromes open
monitor: false, //true,
puppeteerOptions: {
executablePath,
args: [
"--proxy-server=pro.proxy.net:2222",
"--incognito",
"--disable-gpu",
"--disable-dev-shm-usage",
"--disable-setuid-sandbox",
"--no-first-run",
"--no-sandbox",
"--no-zygote"
],
headless: false,
sameDomainDelay: 1000,
retryDelay: 3000,
workerCreationDelay: 3000
}
});
// Define a task
await cluster.task(async ({ page, data: url }) => {
extract(url, page); //call the extract
});
//task
const extract = async ({ page, data: dataJson }) => {
page.setExtraHTTPHeaders({headers})
await page.authenticate({
username: proxy_user,
password: proxy_pass
});
//Randomized Delay
await delay(2000 + (Math.floor(Math.random() * 998) + 1));
const response = await page.goto(dataJson.Url);
}
//loop over inputs, and queue them into cluster
var dataJson = {
url: url
};
cluster.queue(dataJson, extract);
}
// Shutdown after everything is done
await cluster.idle();
await cluster.close();
}

Direct answer
Author of puppeteer-cluster here. The library does not actively block cookies, but makes use of browser.createIncognitoBrowserContext():
Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.
In addition, the docs state that "Incognito browser contexts don't write any browsing data to disk" (source), so that restarting the browser cannot reuse any cookies from disk as there were no data written.
Regarding the library, this means when a job is executed, a new incognito context is created, which does not share any data (cookies, etc.) with other contexts. So as long as Chromium properly implements the incognito browser contexts, there is no data shared between the jobs.
The page you linked only talks about browser.newPage() (which shares cookies between pages) and not about incognito contexts.
Why websites might identify you as a bot
Some websites will still block you, because they use different measures to detect bots. There are headless browser detection tests as well as fingerprinting libraries that might report you as bot if the user agent does not match the browser fingerprint. You might be interested in this answer by me that provides some more detailed explanation how these fingerprints work.
You can try to use a library like puppeteer-extra that comes with a stealth plugin to help you solve the problem. However, this basically is a cat-and-mouse game. The fingerprinting tests might be changed or another sites might use a different "detection" mechanism. All-in-all, there is no way to guarantee that a website does not detect you.
In case you want to use puppeteer-extra, be aware that you can use it in conjunction with puppeteer-cluster (example code).

You can always use PlayWright which is way harder to be recognized as a bot than puppeteer and has options of using multiple browsers etc.

Related

How to let the frontend know when a background job is done?

In Heroku long requests can cause H12 timeout errors.
The request must then be processed...by your application...within 30 seconds to
avoid the timeout.
src
Heroku suggests moving long tasks to background jobs.
Sending an email...Accessing a remote API...
Web scraping / crawling...you should move this heavy lifting into a background job which can run asynchronously from your web request.
src
Heroku's docs say requests shouldn't take longer than 500ms to return a response.
It’s important for a web application to serve end-user requests as
fast as possible. A good rule of thumb is to avoid web requests which
run longer than 500ms. If you find that your app has requests that
take one, two, or more seconds to complete, then you should consider
using a background job instead.
src
So if I have a background job, how do I tell the frontend when the background job is done and what the job returns?
On Heroku their example code just returns the background job id. But this won't give the frontend the information it needs.
app.post('/job', async (req, res) => {
let job = await workQueue.add();
res.json({ id: job.id });
});
For example this method won't tell the frontend when an image is done being uploaded. Or the frontend won't know when a call to an API, like an external exchange rate API, returns a result, like an exchange rate, and what that result is.
Someone suggested using job.finished() but doesn't this get you back where you started? Now your requests are waiting for the queue to finish in order to respond. So your requests are the a same length as when there was no queue and this could lead to timeout errors again.
const result = await job.finished();
res.send(result);
This is example uses Bull, Redis, Node.js.
Someone suggested websockets. I didn't find an example of this yet.
The idea of using a queue for long tasks is that you post the task and
then return immediately. I guess you are updating the database as last
step in your job, and only use the completed event for notifying the
clients. What you need to do in this case is to implement either a
websocket or similar realtime communication and push the notification
to relevant clients. This can become complicated so you can save some
time with a solution like https://pusher.com/ or similar...
https://github.com/OptimalBits/bull/issues/1901
I also saw a solution in heroku's full example, which I didn't originally see:
web server
// Fetch updates for each job
async function updateJobs() {
for (let id of Object.keys(jobs)) {
let res = await fetch(`/job/${id}`);
let result = await res.json();
if (!!jobs[id]) {
jobs[id] = result;
}
render();
}
}
frontend
// Fetch updates for each job
async function updateJobs() {
for (let id of Object.keys(jobs)) {
let res = await fetch(`/job/${id}`);
let result = await res.json();
if (!!jobs[id]) {
jobs[id] = result;
}
render();
}
}
// Attach click handlers and kick off background processes
window.onload = function() {
document.querySelector("#add-job").addEventListener("click", addJob);
document.querySelector("#clear").addEventListener("click", clear);
setInterval(updateJobs, 200);
};

Is it fine to not await for a log.write() promise inside a cloud run container?

I'm using #google-cloud/logging to log some stuff out of my express app over on Cloud Run.
Something like this:
routeHandler.ts
import { Logging } from "#google-cloud/logging";
const logging = new Logging({ projectId: process.env.PROJECT_ID });
const logName = LOG_NAME;
const log = logging.log(logName);
const resource = {
type: "cloud_run_revision",
labels: { ... }
};
export const routeHandler: RequestHandler = (req,res,next) => {
try {
// EXAMPLE: LOG A WARNING
const metadata = { resource, severity: "WARNING" };
const entry = log.entry(metadata,"SOME WARNING MSG");
await log.write(entry);
return res.sendStatus(200);
}
catch(err) {
// EXAMPLE: LOG AN ERROR
const metadata = { resource, severity: "ERROR" };
const entry = log.entry(metadata,"SOME ERROR MSG");
await log.write(entry);
return res.sendStatus(500);
}
};
You can see that the log.write(entry) is asynchronous. So, in theory, it would be recommended to await for it. But here is what the documentation from #google-cloud/logging says:
Doc link
And I got no problem with that. In my real case, even if the log.write() fails, it is inside a try-catch and any errors will be handled just fine.
My problem is that it kind of conflicts with the Cloud Run documentation:
Doc link
Note: If I don't wait for the log.write() call, I'll end the request cycle by responding to the request
And Cloud Run does behave like that. A couple weeks back, I tried to respond immediately to the request and fire some long background job. And the process kind of halted for a while, and I think it restarted once it got another request. Completely unpredictable. And when I ran this test I'm mentioning here, I even had a MIN_INSTANCE=1 set on my cloud run service container. Even that didn't allow my background job to run smoothly. Therefore, I don't think it's fine to leave the process doing background stuff when I've finished handling a request (by doing the "fire and forget" approach).
So, what should I do here?
Posting this answer as a Community Wiki based on #Karl-JorhanSjögren's correct assumption in the comments.
For Log calls on apps running in Cloud Run you are indeed encouraged to take a Fire and Forget approach, since you don't really need to force synchronicity on that.
As mentioned in the comments replying to your concern on the CPU being disabled after the request is fulfilled, the CPU will be throttled first so that the instance can be brought back up quickly and completely disabled after a longer period of inactivity. So firing of small logging calls that in most cases will finish within milliseconds shouldn't be a problem.
What is mentioned in the documentation is targeted at processes that run for Longer periods of time.

Running my Puppeteer app within PM2's cluster mode doesn't take advantage of the multiple processes

While running my Puppeteer app with PM2's cluster mode enabled, during concurrent requests, only one of the processes seems to be utilized instead of all 4 (1 for each of my cores). Here's the basic flow of my program:
helpers.startChrome()
.then((resp) => {
http.createServer(function (req, res) {
const {webSocketUrl} = JSON.parse(resp.body);
let browser = await puppeteer.connect({browserWSEndpoint: webSocketUrl});
const page = await browser.newPage();
... //do puppeteer stuff
await page.close();
await browser.disconnect();
})
})
and here is the startChrome() function:
startChrome: function(){
return new Promise(async (resolve, reject) => {
const opts = {
//chromeFlags: ["--no-sandbox", "--headless", "--use-gl=egl"],
userDataDir: "D:/pupeteercache",
output: 'json'
};
// Launch chrome using chrome-launcher.
const chrome = await chromeLauncher.launch(opts);
opts.port = chrome.port;
// Connect to it using puppeteer.connect().
resp = await util.promisify(request)(`http://localhost:${opts.port}/json/version`);
resolve(resp);
})
}
First, I use a package called chrome-launcher to start up chrome, I then setup a simple http server that listens for incoming requests to my app. When a request is recieved, i connect to the chrome endpoint i setup through chrome-launcher at the beginning.
When i now try to run this app within PM2's cluster mode, 4 separate chrome tabs are opened up (not sure why it works this way but alright), and everything seems to be running fine. But when I send the server 10 concurrent requests to test and see if all processes are getting used, only the first one is. I know this because when i run PM2 monit, only the first process is using any memory.
Can someone explain to me why all the processes aren't utilized? Is it because of how i'm using chrome-launcher to only use one browser with multiple tabs instead of running multiple browsers?
You cannot use the same user directory for multiple instances at same time. If you pass a user directory, no matter what kind of launcher it is, it will automatically pick the running process and create a new tab on that instead.
Puppeteer creates a temporary profile whenever you want to launch the browser. So if you want to utilize 4 instances, pass it a different user data directory on each instance.

How to trigger background-processes, how to receive intermediate results?

I have a NodeJS / background-process issue, that I don't know how to solve it 'elegant', straight, the right way.
The user submits some (like ~10 or more) URLs via a textarea and then they should be processed asynchronous. [a screenshot with puppeteer has to be taken, some information gathered, the screenshot should be processed with sharp and the result should be persisted in a MongoDB. The screenshot via GridFS and the URL in an own collection with a reference to the screenshot].
While this async process is calculated in the background, the page should be updated whenever a URL got processed.
There are so many ways to do that, but which one is the most correct/straightforward/resource saving way?
Browserify and I do it in the browser? No, too much stuff on the client side.. AJAX/Axios posts and wait for the URLs to be processed and reflect the results on the side? Trigger the process before the response gets send back to the client or let the client start the processing?
So, I made a workflow engine of some sort that supports long-running jobs. And I followed this tutorial https://farazdagi.com/2014/rest-and-long-running-jobs/
Which is nothing, when a request is created you just return a status code and when the jobs are completed you just log them somewhere and use that.
For, this I used EventEmitter which is used inside a promise. It's only my solution maybe not elegant, maybe outright wrong. Made a little POC for you.
const events = require('events')
const emitter = new events.EventEmitter();
const actualWork = function() {
return new Promise((res,rej)=>{
setTimeout(res, 1000);
})
}
emitter.on('workCompleted', function(){
// log somewhere
});
app.get('/someroute', (req,res)=>{
res.json({msg:'reqest initiated', id: 'some_id'})
actualWork()
.then(()=>{
emitter.emit('workCompleted', {id: 'some_id'});
});
})
app.get('/someroute/id/status', (req,res)=>{
//get the log
})

Node one async call slows another in parallel

For over a year we've seen interesting patterns that don't always rear themselves but on occasion repeat and we've never been able to figure out why and I'm hoping someone can make sense of it. It may be our approach, it may be the environment (node 8.x & koa), it may be a number of things.
We make two async calls in parallel to our dependencies using the request-promise module.
Simplified code of a single api dependency:
const httpRequest = require("request-promise");
module.exports = function (url) {
const requestOptions = {
uri: ...,
json: true,
resolveWithFullResponse: true
}
return httpRequest(requestOptions)
.then(response => {
status = response.statusCode;
tmDiff = moment().diff(tmStart);
return createResponseObject({
status,
payload: response.body,
})
})
.catch(err => { .... };
});
};
Parallel calls:
const apiResponses = yield {
api1: foo(ctx, route),
api2: bar(ctx)
};
Yet we've seen situations in our response time charts where if 1 is slow, latency seems to follow the other separate service. It doesn't matter what services they are, the pattern has been noticed across > 5 services that may be called in parallel. Does anyone have any ideas what could be causing the supposed latency?
If the latency is caused by a temporarily slowed network connection, then it would be logical to expect both parallel requests to feel that same effect. ping or tracert during the slowdown might give you useful diagnostics to see if it's a general transport issue. If your node.js server (which runs Javascript single threaded) was momentarily busy doing something else with the CPU (serving another request, garbage collecting, etc...), then that would affect the apparent responsiveness of API calls just because it took a moment for node.js to get around to processing the waiting responses.
There are tools that monitor the responsiveness of your own http server on a continual basis (you can set whatever monitoring interval you want). If you have a CPU-hog somewhere, those tools would show a similar lag in responsiveness. There are also tools that monitor the health of your network connection which would also show a connectivity glitch. These are the types of monitoring tools that people whose job it is to maintain a healthy server farm might use. I don't have a name handy for either one, but you can presumably find such tools by searching.

Resources