Node JS with CouchDB for lots o' parsing

Node JS with CouchDB for lots o' parsing - node.js

My team and I are playing around with NodeJS (with jsdom/jQuery) and parsing a lot of HTML documents stored in CouchDB. NodeJS is single threaded so having 8 cores in a serve does not help us at all initially, this is where I was wondering how to best create child processes (workers perhaps?) to process the individual file as it's pulled out from CouchDB?
Here is my thought process:
Main NodeJS script loops through CouchDB view getting the HTML files from documents every X minutes
Spawn a process to parse (jsdom/jQuery) and store the results from each HTML file
We aren't running a webserver at all to handle any of this (all command line) so I am unsure of how to handle this outside of a generic "set up CRON to just run each parsing job seperately". It seems that workers are generally used to process requests coming in from a webserver.
Thoughts?

Use the cluster
var cluster = require("cluster");
var numCPUs = require('os').cpus().length;
var htmlDocs = [...];
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('death', function(worker) {
console.log('worker ' + worker.pid + ' died');
});
} else {
for (var i = process.env.NODE_WORKER_ID; i < htmlDocs.length; i+=numCPUs) {
couch.doWork(htmlDocs[i]);
}
}
This is a classic case of doing work on members in an array and then splitting that work out over multiple processes by having each process do a subset of the array.
Note how we increment i by number of processes. This means worker 1 does 1st, 5th, 9th, etc, worker 2 does 2nd, 6th, 10th, etc.

Related

How to fork a process in node that writes express response

I'd like to fork a long running express request in node and send an express response with the child, allowing the parent to serve other requests. I'm already using cluster but I'd like to fork another process in addition to the cluster for specific long running requests. What I'd like to prevent is all the processes in the cluster being consumed by a specific long running processes, while most of the other requests are fast.
Thanks
var express = require('express');
var webserver = express();
webserver.get("/test", function(request, response) {
// long running HTTP request
response.send(...);
});
What I'm thinking of is something like following, although I'm not sure this works:
var cp = require('child_process');
var express = require('express');
var webserver = express();
webserver.get("/test", function(request, response) {
var child = cp.fork('do_nothing.js');
child.on("message", function(message) {
if(message == "start") {
response.send(...);
process.exit();
}
});
child.send("start");
});
Let me know if anyone knows how to do this.
Edit: So, the idea is that the child could take a long time. There are a limited number of processes in the cluster serving express responses and I don't want to consume them all on a specific long-running request type. In the code below, the entire cluster would be consumed by the long running express requests.
while(1) {
if(rand() % 100 == 0) {
if(fork() == 0) {
sleep(hour(1));
exit(0);
}
} else {
sleep(second(1));
}
waitpid(WAIT_ANY, &status, WNOHANG);
}
Edit: I am going to mark the self-answer as solved. I'm sure there's a way to pass a socket to a child but it's not really necessary because the cluster master can manage all child processes. Thanks for your help.

Your second code block is confusing because it appears that you're killing the parent process with process.exit() rather than the child.
In any case, if we assume the problem is this:
You have a cluster of "regular processes".
Occasionally, you want to take an incoming request that was assigned to one of the cluster processes and pass it off to a long running child that will eventually send the response.
After sending the response, the long running child process should exit.
You have a couple options.
You can have the clustered process that was assigned the request, start up a child, send it some initial data and listen for a message back from the child. When it gets the message back from the child, it can send the response and kill the child. This appears to be what you're attempting to do in your second code block.
You can have the clustered process that was assigned the request, start up a child and reassign the request socket to the child process and the child can then own that socket from then on. When it finally sends the response, it can then exit itself.
The first is simpler because no socket assignment from one process to another is required. To implement the second, you'd have to write or find the code to do socket reassignment and then reconstituted as an express request within the child. The cluster module does something like this so the code is there to be found and learned from, but I'm not aware of a trivial way to do it.
Personally, I don't see any particular downside to the first. I suppose if the clustered process were to die for some , you'd lose the long running request socket, but hopefully you can just code your clustered processes not to die unnecessarily.
You can read this article on sending a socket to a new node.js process:
Sending a socket to a forked process
And, this node.js doc on sending a socket:
Example: sending a socket object

So, I've verified that this is not necessary for my use case, but I was able to get it working using the code below. It's not exactly what the OP asks for, but it works.
What it's doing is sending an instruction to the cluster master, which forks the additional process upon receipt of the slow express request.
Since the express request doesn't need to know the status of the newly forked cluster worker, it just handles the slow request as normal and then exits.
The instruction to the cluster master informs the master not to replace the dying slow express request process, so the number of workers reverts to the original number after the slow request finishes.
The pool will increase in size when there are slow requests, but revert to normal. This will prevent like 20 simultaneous slow requests from bringing down the cluster.
var numberOfWorkers = 10;
var workerCount = 0;
var slowRequestPids = { };
if (cluster.isMaster) {
for(var i = 0; i < numberOfWorkers; i++) {
workerCount++;
cluster.fork();
}
cluster.on('exit', function(worker) {
workerCount--;
var pidString = String(worker.process.pid);
if(pidString in slowRequestPids) {
delete slowRequestPids[pidString];
if(workerCount >= numberOfWorkers) {
logger.info('not forking replacement for slow process');
return;
}
}
logger.info('forking replacement for a process that died unexpectedly');
workerCount++;
cluster.fork();
}
cluster.on("message", function(msg) {
if(typeof msg.fork != "undefined" && workerCount < 100) {
logger.info("forking additional process upon slow request");
slowRequestPids[msg.fork] = 1;
workerCount++;
cluster.fork();
}
});
return;
}
webserver.use("/slow", function(req, res) {
process.send({fork: String(process.pid) });
sleep.sleep(300);
res.send({ response_from: "virtual child" });
res.on("finish", function() {
logger.info('process exits, restoring cluster to original size');
process.exit();
});
});

Node cluster; only one process being used

I'm running a clustered node app, with 8 worker processes. I'm giving output when serving requests, and the output includes the ID of the process which handled the request:
app.get('/some-url', function(req, res) {
console.log('Request being handled by process #' + process.pid);
res.status(200).text('yayyy');
});
When I furiously refresh /some-url, I see in the output that the same process is handling the request every time.
I used node load-test to query my app. Again, even with 8 workers available, only one of them handles every single request. This is obviously undesirable as I wish to load-test the clustered app to see the overall performance of all processes working together.
Here's how I'm initializing the app:
var cluster = require('cluster');
if (cluster.isMaster) {
for (var i = 0; i < 8; i++) cluster.fork();
} else {
var app = require('express')();
// ... do all setup on `app`...
var server = require('http').createServer(app);
server.listen(8000);
}
How do I get all my workers working?

Your request does not use any ressources. I suspect that the same worker is always called, because it just finishes to handle the request before the next one comes in.
What happens if you do some calculation inside that takes more time than the time needed to handle a request ? As it stands, the worker is never busy between accepting a request and answering it.

Node.js Cluster: Managing Workers

we're diving deeper in Node.js architecture, to achieve fully understanding, how to scale our application.
Clear solution is cluster usage https://nodejs.org/api/cluster.html. Everything seems to be fine, apart of workers management description:
Node.js does not automatically manage the number of workers for you, however. It is your responsibility to manage the worker pool for your application's needs.
I was searching, how to really manage the workers, but most solutions, says:
Start so many workers as you've got cores.
But I would like to dynamically scale up or down my workers count, depending on current load on server. So if there is load on server and queue is getting longer, I would like to start next worker. In another way, when there isn't so much load, I would like to shut down workers (and leave f.e. minimum 2 of them).
The ideal place, will be for me Master Process queue, and event when new Request is coming to Master Process. On this place we can decide if we need next worker.
Do you have any solution or experience with managing workers from Master Thread in Cluster? Starting and killing them dynamically?
Regards,
Radek

following code will help you to understand to create cluster on request basis.
this program will genrate new cluster in every 10 request.
Note: you need to open http://localhost:8000/ and refresh the page for increasing request.
var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;
var numReqs = 0;
var initialRequest = 10;
var maxcluster = 10;
var totalcluster = 2;
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < 2; i++) {
var worker = cluster.fork();
console.log('cluster master');
worker.on('message', function(msg) {
if (msg.cmd && msg.cmd == 'notifyRequest') {
numReqs++;
}
});
}
setInterval(function() {
console.log("numReqs =", numReqs);
isNeedWorker(numReqs) && cluster.fork();
}, 1000);
} else {
console.log('cluster one initilize');
// Worker processes have a http server.
http.Server(function(req, res) {
res.writeHead(200);
res.end("hello world\n");
// Send message to master process
process.send({ cmd: 'notifyRequest' });
}).listen(8000);
}
function isNeedWorker(numReqs) {
if( numReqs >= initialRequest && totalcluster < numCPUs ) {
initialRequest = initialRequest + 10;
totalcluster = totalcluster + 1;
return true;
} else {
return false;
}
}

To manually manage your workers, you need a messaging layer to facilitate inter process communication. With IPC master and worker can communicate effectively, by default and architecture stand point this behavior is already implemented in the process module native. However i find the native implementation not flexible or robust enough to handle horizontal scaling due to network requests.
One obvious solution Redis as a message broker to facilitate this method of master and slave communication. However this solution also as its faults , which is context latency, directly linked to command and reply.
Further research led me to RabbitMQ,great fit for distributing time-consuming tasks among multiple workers.The main idea behind Work Queues (aka: Task Queues) is to avoid doing a resource-intensive task immediately and having to wait for it to complete. Instead we schedule the task to be done later. We encapsulate a task as a message and send it to the queue. A worker process running in the background will pop the tasks and eventually execute the job. When you run many workers the tasks will be shared between them.
To implement a robust server , read this link , it may give some insights. Link

Nodejs Clustering and expressjs sessions

I'm trying to build nodejs application which will take advantage of multicore machines ( a.k.a. clustering ) and I got a question about sessions. My code looks like this:
var cluster = exports.cluster = require('cluster');
var numCPUs = require('os').cpus().length;
if (cluster.isMaster) {
for (var i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', function(worker, code, signal) {
console.log('worker ' + worker.process.pid + ' died. Trying to respawn...');
cluster.fork();
});
} else {
//spawn express etc
}
My question is: Everytime a single user hits random node instance or for example the first time he opens the page and hits node N4 and till his session expires, he hit node N4 on every request? For those who didn't understand my question, I will try to explain what I'm worried about:
A user enters my page, he login on node N3, then I set req.session.userdata to a random data, he refreshes the page and he hit Node N4, will I be able to access req.session.userdata from different Node? That mean there is a chance for the user to get randomly logged out or I'm just not understanding how clustering with express works?

You're correct that the in memory session store in Connect/Express is unsuitable for supporting more than one instance. The solution is to implement a session store with a backing database. My recommendation is connect-redis, and example code is at Session Undefined - Using Connect-Redis / ExpressJS / Node
But there are dozens of options.

Node.js Getting Slave Node Info

OK, so I am creating a node/mongo based CMS. I have the basic framework down. I am wanting to add live stats to the admin panel and I was wondering if I could do the following:
I create slave nodes from the master node using the native Cluster call (v0.6).
function start_workers(num_workers){
for (var i = 0; i < num_workers; i++) {
exports.workers[i] = cluster.fork();
console.log('Worker: '+exports.workers[i].pid+' Is Online');
}
}
Which spawns X number of nodes. I was wondering can I get stats on each worker I spawn?
for instance:
for(var x in workers){
console.log(workers[x].process.cwd());
}
Thanks a lot for your time!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string