Selenium Webdriver JS Scraping Parallely [nodejs]

Selenium Webdriver JS Scraping Parallely [nodejs] - node.js

I'm trying to create a pool of Phantom Webdrivers [using webdriverjs] like
var driver = new Webdriver.Builder().withCapabilities(Webdriver.Capabilities.phantomjs()).build();
Once the pool gets populated [I see n-number of phantom processes spawned], I try to do driver.get [using different drivers in the pool] of different urls expecting them to work parallely [as driver.get is async].
But I always see them being done sequentially.
Can't we load different urls parallely via different web driver instances?
If not possible in this way how else could I solve this issue?
Very Basic Impl of my question would look like below
var Webdriver = require('selenium-webdriver'),
function getInstance() {
return new Webdriver.Builder().withCapabilities(Webdriver.Capabilities.phantomjs()).build();
}
var pool = [];
for (var i = 0; i < 3; i++) {
pool.push(getInstance());
}
pool[0].get("http://mashable.com/2014/01/14/outdated-web-features/").then(function () {
console.log(0);
});
pool[1].get("http://google.com").then(function () {
console.log(1);
});
pool[2].get("http://techcrunch.com").then(function () {
console.log(2);
});
PS: Have already posted it here
Update:
I tried with selenium grid with the following setup; as it was mentioned that it can run tests parallely
Hub:
java -jar selenium/selenium-server-standale-2.39.0.jar -hosost 127.0.0.1 -port 4444 -role hub -nodeTimeout 600
Phantom:
phantomjs --webdriver=7777 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true
phantomjs --webdriver=7877 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true
phantomjs --webdriver=6777 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true
Still I see the get command getting queued and executed sequentially instead being parall. [But gets properly distributed across 3 instances]
Am I still missing something out?
Why is it mentioned "scale by distributing tests on several machines ( parallel execution )" in the doc?
What is parallel as per the hub? I'm getting clueless

I guess I got the issue..
Basically https://code.google.com/p/selenium/source/browse/javascript/node/selenium-webdriver/executors.js#39 Is synchronous and blocking operation [atleast the get].
Whenever the get command is issued node's main thread get's stuck there. No further code execution.

A little late but for me it worked with webdriver.promise.createFlow.
You just have to wrap your code in webdriver.promise.createFlow() { ... }); and it works for me! Here's an example from Make parallel requests to a Selenium Webdriver grid. All thanks to the answerer there...
var flows = [0,1,2,3].map(function(index) {
return webdriver.promise.createFlow(function() {
var driver = new webdriver.Builder().forBrowser('firefox').usingServer('http://someurl:44111/wd/hub/').build();
console.log('Get');
driver.get('http://www.somepage.com').then(function() {
console.log('Screenshot');
driver.takeScreenshot().then(function(data){
console.log('foo/test' + index + '.png');
//var decodedImage = new Buffer(data, 'base64')
driver.quit();
});
});
});
});

I had the same issues, I finally got around the problem using child_process.
The way my app is setup is that I have many tasks that does different things, and that needs to run simultaneously (each of those use a different driver instance), obviously it was not working.
I now start those tasks in a child_process (which will run a new V8 process) and it does run everything in parallel.

Related

Node.js cluster for only a specific function within an Express app

I am trying to run a Node.js cluster within my Express app, but only for one specific function.
My app is a standard Express app generated with the express app generator.
My app initially scrapes an eCommerce website to get a list of categories in an array. I want to be able to then scrape each category's products, concurrently, using child processes.
I do not want to have the whole Express app inside the child processes. When the app starts up I want only one process to scrape for the initial categories. Once that is done I only want the function that scrapes the products to be run concurrently in the cluster.
I have tried the following:
delegation-controller.js
var {em} = require('./entry-controller');
const cluster = require('cluster');
const numCPUs = require('os').cpus().length;
class DelegationController {
links = [];
constructor() {
em.on('PageLinks', links => {
this.links = links;
this.startCategoryCrawl();
});
}
startCategoryCrawl() {
if (cluster.isMaster) {
console.log(`Master ${process.pid} is running`);
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', (worker, code, signal) => {
console.log(`worker ${worker.process.pid} died`);
});
} else {
console.log(`Worker ${process.pid} started`);
process.exit();
}
}
}
module.exports = DelegationController;
But then I got an error:
/ecommerce-scraper/bin/www:58
throw error;
^
Error: bind EADDRINUSE null:3000
Which I am guessing is because it is trying to start the express server again, but it is in use.
Am I able to do what I am trying to do, or am I misunderstanding how Node.js clusters work?

I believe this is not the case where you make use of cluster module. Instead you need the child_process module. This module lets you create a separate process. Here is the documentation.

I typically create my own Worker bootstrap that sits on top of my application. For things that need to run once, I have a convenient runonce function that is given a name and callback. The function checks the primary process for an open (non-busy process) and sends back the PID. If the PID matches (because all processes will claim ownership) the callback executes. If not, the function returns.
Example:
https://gist.github.com/jonshipman/abe627c687a46e7f5ea4b36bb919666c
NodeJS clustering creates identical copies of your application (through the cluster.fork(). It's up to your application to ensure that multiple actions aren't run twice (when they aren't expected to).
I believe, when using Express or https.createServer, it's setup in a way so that it doesn't listen to the same port multiple times. Instead each the prime process will distribute the load internally.

Node.js library for executing long running background tasks

I have a architecture with a express.js webserver that accepts new tasks over a REST API.
Furthermore, I have must have another process that creates and supervises many other tasks on other servers (distributed system). This process should be running in the background and runs for a very long time (months, years).
Now the questions is:
1)
Should I create one single Node.js app with a task queue such as bull.js/Redis or Celery/Redis that basically launches this long running task once in the beginning.
Or
2)
Should I have two processes, one for the REST API and another daemon processes that schedules and manages the tasks in the distributed system?
I heavily lean towards solution 2).
Drawn:

I am facing the same problem now. as we know nodejs run in single thread. but we can create workers for parallel or handle functions that take some time that we don't want to affect our main server. fortunately nodejs support multi-threading.
take a look at this example:
const worker = require('worker_threads');
const {
Worker, isMainThread, parentPort, workerData
} = require('worker_threads');
if (isMainThread) {
module.exports = function parseJSAsync(script) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: script
});
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0)
reject(new Error(`Worker stopped with exit code ${code}`));
});
});
};
} else {
const { parse } = require('some-js-parsing-library');
const script = workerData;
parentPort.postMessage(parse(script));
}
https://nodejs.org/api/worker_threads.html
search some articles about multi-threading in nodejs. but remember one here , the state cannot be shared with threads. you can use some message-broker like kafka, rabbitmq(my recommended), redis for handling such needs.
kafka is quite difficult to configure in production.
rabbitmq is good because you can store messages, queues and .., in local storage too. but personally I could not find any proper solution for load balancing these threads . maybe this is not your answer, but I hope you get some clue here.

Cron job failed without a reason

I am in a situation where I have a CRON task on google app engine (using flex environment) that just dies after some time, but I have no trace WHY (checked the GA Logs, nothing, tried try/catch, and explicitly log it - no error).
I have explicitly verified that if I create a cron task that runs for 8 minutes (but doesn't do much - just sleeps and updates database every second), it will run successfully. This is just to prove that CRON jobs can at least run 8 minutes if not more. & I have set up the Express & NodeJS combo up correctly.
This is all fine, but seems that my other cron job dies in 2-3 minutes, so quite fast. It is hitting some kind of limit, but I have no idea how to control for it, or even what limit it is, so all I can do is speculate.
I will tell more about my CRON task. It is basically rapidly querying MongoDB database where every query is quite fast. I've tried the same code locally, and there are no problems.
My speculation is that I am somehow creating too many MongoDB requests at once, and potentially running out of something?
Here's a pseudocode (just to describe what kind of scale data we're talking about - the numbers and flow are exactly the same):
function q1() {
return await mongoExecute(async (db) => {
const [l1, l2] = await Promise.all([
db.collection('Obj1').count({uid1: c1, u2action: 'L'}),
db.collection('Obj1').count({uid2: c2, u1action: 'L'}),
]);
return l1+l2;
});
}
for(let i = 0; i < 8000; i++) {
const allImportantInformation = Promise.all([
q1(),
q2(),
q3(),
.....
q10()
])
await mongoDb.saveToServer(document);
}
It is getting somewhere around i=1600 before the CRON job just dies without any explanation. The GA Cron Job panel clearly says the JOB has failed.
Here is also my mongoExecute (which is just a separate module that caches the db object, which hopefully is the correct practice in order to ensure that mongodb pooling works correctly.)
import { MongoClient, Db } from 'mongodb';
let db = null;
let promiseInProgress = null;
export async function mongoExecute<T> (executor: (instance: Db) => T): Promise<T | null> {
if (!db) {
if (!promiseInProgress) {
promiseInProgress = new Promise(async (resolve, reject) => {
const tempDb = await MongoClient.connect(process.env.MONGODB_URL);
resolve(tempDb);
});
}
db = await promiseInProgress;
}
try {
const value = await executor(db);
return value;
} catch (error) {
console.log(error);
return null;
}
}
What would be the solution? My idea is to basically ensure less requests are made at once (so all the promises would be sequential, and potentially add sleep between each cycle in the FOR.
I don't understand because it works fine up until some specific point (and quite big point, it's definitely different amount, sometimes it is 800, sometimes 1200, etc).
Is there any "running out of TCP connections" scenario happening? Theoretically we shouldn't run out of anything because we don't have much open at any given point.
It seems to be working if I throw 200ms wait between each cycle & I suspect I can figure out solution, all the items don't have to be updated in the same CRON execution, but it is a bit annoying, and I would like to know what's going on.
Is the garbage collector not catching up fast enough, why exactly is GA silently failing my cron task?

I discovered what the bug is, and fixed it accordingly.
Let me rephrase it; I have no idea what the bug was, and having no errors at any point was discouraging, however I managed to fix (lucky guess) whatever was happening by updating my nodejs mongodb driver to the latest version (from 2.xx -> 3.1.10).
No sleeps needed in my code anymore.

CasperJS, parallel browsing WITH the testing framework

Question : I would like to know if it's possible to do parallel browsing with the testing framework in one script file, so with the tester module and casperjs test command.
I've seen some people create two casper instances :
CasperJS simultaneous requests and https://groups.google.com/forum/#!topic/casperjs/Scx4Cjqp7hE , but as said in the doc, we can't create new casper instance in a test script.
So i tried doing something similar-simple example- with a casper testing script (just copy and execute this it will work):
var url1 = "http://casperjs.readthedocs.org/en/latest/testing.html"
,url2 = "http://casperjs.readthedocs.org/en/latest/testing.html"
;
var casperActions = {
process1: function () {
casper.test.begin('\n********* First processus with our test suite : ***********\n', function suite(test) {
"use strict";
casper.start()
.thenOpen(url1,function(){
this.echo("1","INFO");
});
casper.wait(10000,function(){
casper.test.comment("If parallel, it won't be printed before comment of the second processus !");
})
.run(function() {
this.test.comment('----------------------- First processus over ------------------------\n');
test.done();
});
});
},
process2: function () {
casper.test.begin('\n********* Second processus with our test suite : ***********\n', function suite(test) {
"use strict";
casper.start()
.thenOpen(url1,function(){
this.echo("2","INFO");
});
casper.test.comment("Hi, if parallel, i'm first !");
casper.run(function() {
this.test.comment('----------------------- Second processus over ------------------------\n');
test.done();
});
});
}
};
['process1', 'process2'].forEach(function(href) {
casperActions[href]();
});
But it's not parallel, they are executed one by one.
Currently, i do some parallel browsing but with node so not in the file itself, using child process. So if you split my previous code in two files -proc1.js,proc2.js- (just the two scenarios->casper.test.begin{...}), and launch the code below via node, something like that will work-with Linux, i have to search the equivalent syntax for windows- :
var exec = require("child_process").exec
;
exec('casperjs test proc1.js',function(err,stdout,stderr){
console.log('stdout: ' + stdout);
console.log('endprocess1');
});
exec('casperjs test proc2.js',function(err,stdout,stderr){
console.log('stdout: ' + stdout);
console.log('endprocess2');
});
My problem is that the redirections and open new urls is quite long, so i want for some of them being execute in parallel. I could do XXX files and launch them in parallel with node, but i don't want XXX files with 5 lines of code, so if someone succeeded (if it's possible) to open urls in parrallel in the same testing file without node (so without multiple processes), please teach me!
And i would like to know what is the difference between chaining instructions, or re-use the casper object each time :
so between that :
casper.test.begin('\n********* First processus with our test suite : ***********\n', function suite(test) {
"use strict";
casper.start()
.thenOpen(url1,function(){
this.echo("1","INFO");
})
.wait(10000,function(){
casper.test.comment("If parallel, it won't be print before comment of the second processus !");
})
.run(function() {
this.test.comment('----------------------- First processus over ------------------------\n');
test.done();
});
});
And that :
casper.test.begin('\n********* First processus with our test suite : ***********\n', function suite(test) {
"use strict";
casper.start();
casper.thenOpen(url1,function(){
this.echo("1","INFO");
});
casper.wait(10000,function(){
casper.test.comment("If parallel, it won't be print before comment of the second processus !");
})
casper.run(function() {
this.test.comment('----------------------- First processus over ------------------------\n');
test.done();
});
});
Chaining my instructions, will it block all the chain if one of my step fail (promise rejected) instead of executing every casper steps?
So it would be better to chaining instructions with dependant steps [like thenClick(selector)] and use the casper object with independant steps (like open a new url), wouldn't it?
Edit : i tried and if a step fail, chained or not, it will stop all the next steps, so i don't see the difference using chained steps or not...

Well, chaining or using the casper object each time is just a matter of taste, it does the same, and we can't launch several instances of casper in a testing script. If you have a loop which opens some links, you'll have to wait for each page to be loaded sequentially.
To launch parallel browsing with the testing framework, you have to do multiple processes, so using node does the trick.
After digging, I finally split files with too many redirections to be not longer than my main scenario which can't be split. A folder with 15 files is executed -parallel- in 2/4 min, on local machine.

There's no official support for parallel browsing right now in casperjs, There are multiple work arounds I've set up several different environments and am about to test witch one is the best
I see one person is working with multiple casper instances this way.
var google = require('casper').create();
var yahoo = require('casper').create();
google.start('http://google.com/');
yahoo.start('http://yahoo.com/', function() {
this.echo(google.getTitle());
});
google.run(function() {});
yahoo.run(function() {});
setTimeout(function() {
yahoo.exit();
}, 5000);
Currently I am running multiple caspers in node using 'child_process'. It is very heave on both CPU and memory

Background process, loading bar

Most server-side-scripting languages have an exec function (node, php, ruby, etc). This allows the programming language to interact with the shell.
I wish to use exec() in node.js to run large processes, things I used to do with AJAX requests in the browser. I would like a simple progress / loading bar to display to the user the progress.
I found out the hard way in the example below that the callback / asynchronous nature of the exec function will make this example take over 5 seconds to load.
What I want is some way to get the content of browser to be updated (ajax) with the current state of the execution like a loading bar. But I don't want the ran file to be dependent on the browser.
Any idea?
my route
exports.exec = function(req,res){
// http://nodejs.org/api.html#_child_processes
var sys = require('sys')
var exec = require('child_process').exec;
var child;
child = exec("node exec/test.js", function (error, stdout, stderr) {
var o = {
"error":error,
"stdout":stdout,
"stderr":stderr,
};
o = JSON.stringify(o);
res.send(o);
});
};
my test.js file
var sys = require('sys');
var count = 0;
var interval = setInterval(
function(){
sys.print('hello'+count);
count++
if(count == 5){
clearInterval(interval);
}
},
1000);

You should use socket.io for this.
Here is a demo app to get you started with socket.io
You emit an event for every 1% using socket.io, the browser listen to it and update a bar.
You can't use exec you need a streamed output.
Therefore you'd rather use child_process.
On the server.
var spawn = require('child_process').spawn,
exec = spawn('node exec/test.js');
exec.stdout.on('data', function (message) {
socket.emit('process', message);
});
On the sub process:
console.log('10%')
// ...
console.log('20%')
// ...
console.log('30%')
If your sub process is a node script you could do something a lot more elegant.
Rather than having to talk with a stream stdout you could use the cluster module of node to send messages between the master and the slaves process.
I made a fork of the previous demo app adding a route /exec which demonstrate how to achieve this.
When I'll have more time I'll make another demo app, this is a quite interesting and educational test. Thanks for the idea :D.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Selenium Webdriver JS Scraping Parallely [nodejs] - node.js

I guess I got the issue.. Basically https://code.google.com/p/selenium/source/browse/javascript/node/selenium-webdriver/executors.js#39 Is synchronous and blocking operation [atleast the get]. Whenever the get command is issued node's main thread get's stuck there. No further code execution.

Related

Node.js cluster for only a specific function within an Express app

Node.js library for executing long running background tasks

Cron job failed without a reason

CasperJS, parallel browsing WITH the testing framework

Background process, loading bar

Categories

Resources