Node.js is famous for concurrency, however, I'm confused by how to make it work concurrently. I started two requests from Chrome one by one very quickly, and I Expected the outputs in console should be:
"get a new request"
immediately after my second request, "get a new request" should be printed
after several seconds, "end the new request"
after several seconds, "end the new request"
However, what I saw is:
"get a new request"
after several seconds, "end the new request"
"get a new request"
after several seconds, end the new request
That means the second request is NOT handled until the first one is done. Below is my sample code, anything I missed?
var http = require("http");
var url = require("url");
function start(route) {
http.createServer(function(request, response) {
console.log('get a new request');
// a time consuming loop
for (var i=0; i<10000000000; ++i) {
}
route(url.parse(request.url).pathname);
response.writeHead(200, {"Content-Type": "text/plain"});
response.end();
console.log('end the new request');
}).listen(5858);
}
function saySomething(something) {
console.log(something);
}
exports.start = start;
exports.saySomething = saySomething;
You don't have to do anything.
It's based on non blocking I/O. Put simply, there is an event loop. A certain set of sync code are run, once done, the next iteration is run that picks up the next set of sync code to run. Anytime an async op is run (db fetch, setTimeout, reading a file, etc) the next tick of the event loop is run. This way there is never any code just waiting.
It's not threaded. In your example, the for loop is in one continuous chunk of code, so js will run the entire for loop before it can handle another http request.
Try putting a setTimeout around the for loop so that node can switch to the next event loop and in your case handle a web request.
node can't handle these:
for (var i=0; i<10000000000; ++i) {}
concurrently. But it handles IO concurrently
You might want to look at Clusters:
http://nodejs.org/api/cluster.html#cluster_how_it_works
http://rowanmanning.com/posts/node-cluster-and-express/
T̶h̶i̶s̶ ̶i̶s̶ ̶t̶h̶e̶ ̶e̶x̶p̶e̶c̶t̶e̶d̶ ̶b̶e̶h̶a̶v̶i̶o̶r̶,̶ ̶w̶e̶ ̶c̶a̶l̶l̶ ̶t̶h̶i̶s̶ ̶̶b̶l̶o̶c̶k̶i̶n̶g̶̶.̶ ̶T̶h̶e̶ ̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶f̶o̶r̶ ̶h̶a̶n̶d̶l̶i̶n̶g̶ ̶c̶o̶n̶c̶u̶r̶r̶e̶n̶t̶ ̶r̶e̶q̶u̶e̶s̶t̶ ̶i̶s̶ ̶m̶a̶k̶i̶n̶g̶ ̶t̶h̶e̶ ̶c̶o̶d̶e̶ ̶̶n̶o̶n̶-̶b̶l̶o̶c̶k̶i̶n̶g̶̶.̶ ̶A̶s̶ ̶s̶o̶o̶n̶ ̶a̶s̶ ̶y̶o̶u̶ ̶c̶a̶l̶l̶e̶d̶ ̶̶r̶e̶s̶p̶o̶n̶s̶e̶.̶w̶r̶i̶t̶e̶H̶e̶a̶d̶̶ ̶t̶h̶e̶ ̶c̶o̶d̶e̶ ̶b̶e̶g̶a̶n̶ ̶b̶l̶o̶c̶k̶i̶n̶g̶ ̶w̶a̶i̶t̶i̶n̶g̶ ̶f̶o̶r̶ ̶̶r̶e̶s̶p̶o̶n̶s̶e̶.̶e̶n̶d̶̶.
EDIT 7/8/14:
Had to deal with this problem recently and found out you can use threads for this:
https://www.npmjs.org/package/webworker-threads
Webworker-threads provides an asynchronous API for CPU-bound tasks that's missing in Node.js:
var Worker = require('webworker-threads').Worker;
require('http').createServer(function (req,res) {
var fibo = new Worker(function() {
function fibo (n) {
return n > 1 ? fibo(n - 1) + fibo(n - 2) : 1;
}
this.onmessage = function (event) {
postMessage(fibo(event.data));
}
});
fibo.onmessage = function (event) {
res.end('fib(40) = ' + event.data);
};
fibo.postMessage(40);
}).listen(port);
And it won't block the event loop because for each request, the fibo worker will run in parallel in a separate background thread.
Related
This question already has answers here:
How node.js server serve next request, if current request have huge computation?
(2 answers)
NodeJs how to create a non-blocking computation
(5 answers)
Closed 5 years ago.
All:
I am pretty new to Node async programming, I wonder how can I write some Express request handler which can handle time consuming heavy calculation task without block Express handling following request?
I thought setTimeout can do that to put the job in a event loop, but it still block other requests:
var express = require('express');
var router = express.Router();
function heavy(callback){
setTimeout(callback, 1);
}
router.get('/', function(req, res, next) {
var callback = function(req, res){
var loop = +req.query.loop;
for(var i=0; i<loop; i++){
for(var j=0; j<loop; j++){}
}
res.send("finished task: "+Date.now());
}.bind(null, req, res);
heavy(callback)
});
I guess I did not understand how setTimeout works(my understanding about setTimeout is after that 1ms delay it will fire up the callback in a seperated thread/process without blocking other call of heavy), could any one show me how to do this without blocking other request to heavy()?
Thanks
Instead of setTimeout it's better to use process.nextTick or setImmediate (depending od when you want your callback to be run). But it is not enough to put a long running code into a function because it will still block your thread, just a millisecond later.
You need to break your code and run setImmediate or process.nextTick multiple times - like in every iteration and then schedule a new iteration from that. Otherwise you will not gain anything.
Example
Instead of a code like this:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
you can use code like this:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
The first one will block your thread (and likely overflow your call stack) and the second one will not block (or, more precisely, it will block your thread 10000000 times for a very brief moment, letting other stuff to run in between those moments).
You can also consider spawning an external process or writing a native add on in C/C++ where you can use threads.
For more info see:
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js
NodeJS, Promises and performance
I'm writing some testing code in Node.js that just repeatedly POSTs HTTP requests to a web-server. In simplified form:
function doPost(opts, data) {
var post_req = http.request(opts, function(res) {
res.setEncoding('utf8')
res.on('data', function (chunk) { })
})
post_req.write(JSON.stringify(data))
post_req.end()
}
setInterval(doPost, interval, opts, msg)
I'd prefer that these requests are issued sequentially, i.e. that a subsequent POST was not sent until the first POST received a response.
My question is: due to the non-blocking architecture of the underlying libuv library used by the runtime, is it possible that this code sends one POST out over the connection to the web-server, but then is able to execute another post even if a response from the server has not yet arrived?
If I imagine this with a select() loop, I'd be free to call write() for the second POST and just get EWOULDBLOCK. Or if the network drops, will it just build up a backlog of POST request queued up to the IO thread-pool? It's unclear to me what behavior I should expect in this case. Is there something I must do to enforce completion of a POST before the next POST can start?
Inherintly Node.js runs on a single thread, to run multiple processes, you'll have to run clusters, they are are somewhat akin to multi-threading in Java. (See Node.js documentation on clusters). For example, your code will look something like this:
var cluster = require('cluster');
var numCPUs = require('os').cpus().length;
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', function(worker, code, signal) {
console.log('worker ' + worker.process.pid + ' died');
});
}
else {
//call the code in doPost
doPost(opts, data);
}
I think I've found my answer. I ran some tests under packet capture and found that when the network drops it's important to throttle your POST requests otherwise requests get enqueue'd to the IO pool and depending on the state of connectivity, some may send, others may not, and message order is mangled.
I'm trying to scrape some URLS from a webservice, its working perfect but I need to scrape something like 10,000 pages from the same web servicve.
I do this by creating multiple phantomJS processes and they each open and evaluate a different URL (Its the same service, all I change is one parameter in the URL of the website).
Problem is I don't want to open 10,000 pages at once, since I don't want their service to crash, and I don't want my server to crash either.
I'm trying to make some logic of opening/evaluating/insertingToDB ~10 pages, and then sleeping for 1 minute or so.
Let's say this is what I have now:
var numOfRequests = 10,000; //Total requests
for (var dataIndex = 0; dataIndex < numOfRequests; dataIndex++) {
phantom.create({'port' : freeport}, function(ph) {
ph.createPage(function(page) {
page.open("http://..." + data[dataIncFirstPage], function(status) {
I want to insert somewhere in the middle something like:
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
Every where I try to place sleepJS the program crashes/freezes/loops forever...
Any idea what I should try?
I've tried placing the above code as the first line after the for loop, but this doesn't work (maybe because of the callback functions that are waiting to fire..)
If I place it inside the phantom.create() callback also doesn't work..
Realize that NodeJS runs asynchronously and in your for-loop, each method call is being executing one after the other. That phantom.create call finishes near immediately, and then the next cycle of the for-loop kicks in.
To answer your question, you want the sleep command at the end of the phantom.create block, still in side the for-loop. Like this:
var numOfRequests = 10000; // Total requests
for( var dataIndex = 0; dataIndex < numOfRequests; dataIndex++ ) {
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here
} );
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
}
Also, consider using a package to help with these control flow issues. Async is a good one, and has a method, eachLimit that will concurrently run a number of processes, up to a limit. Handy! You will need to create an input object array for each iteration you wish to run, like this:
var dataInputs = [ { id: 0, data: "/abc"}, { id : 1, data : "/def"} ];
function processPhantom( dataItem, callback ){
console.log("Starting processing for " + JSON.stringify( dataItem ) );
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here.
//When done, in inner-most callback, call:
//callback(null); //let the next parallel items into the queue
//or
//callback( new Error("Something went wrong") ); //break the processing
} );
}
async.eachLimit( dataInputs, 10, processPhantom, function( err ){
//Can check for err.
//It is here that everything is finished.
console.log("Finished with async.eachLimit");
});
Sleeping for a minute isn't a bad idea, but in groups of 10, that will take you 1000 minutes, which is over 16 hours! Would be more convenient for you to only call when there is space in your queue - and be sure to log what requests are in process, and have completed.
I'm trying to write a small node application that will search through and parse a large number of files on the file system.
In order to speed up the search, we are attempting to use some sort of map reduce. The plan would be the following simplified scenario:
Web request comes in with a search query
3 processes are started that each get assigned 1000 (different) files
once a process completes, it would 'return' it's results back to the main thread
once all processes complete, the main thread would continue by returning the combined result as a JSON result
The questions I have with this are:
Is this doable in Node?
What is the recommended way of doing it?
I've been fiddling, but come no further then following example using Process:
initiator:
function Worker() {
return child_process.fork("myProcess.js");
}
for(var i = 0; i < require('os').cpus().length; i++){
var process = new Worker();
process.send(workItems.slice(i * itemsPerProcess, (i+1) * itemsPerProcess));
}
myProcess.js
process.on('message', function(msg) {
var valuesToReturn = [];
// Do file reading here
//How would I return valuesToReturn?
process.exit(0);
}
Few sidenotes:
I'm aware the number of processes should be dependent of the number of CPU's on the server
I'm also aware of speed restrictions in a file system. Consider it a proof of concept before we move this to a database or Lucene instance :-)
Should be doable. As a simple example:
// parent.js
var child_process = require('child_process');
var numchild = require('os').cpus().length;
var done = 0;
for (var i = 0; i < numchild; i++){
var child = child_process.fork('./child');
child.send((i + 1) * 1000);
child.on('message', function(message) {
console.log('[parent] received message from child:', message);
done++;
if (done === numchild) {
console.log('[parent] received all results');
...
}
});
}
// child.js
process.on('message', function(message) {
console.log('[child] received message from server:', message);
setTimeout(function() {
process.send({
child : process.pid,
result : message + 1
});
process.disconnect();
}, (0.5 + Math.random()) * 5000);
});
So the parent process spawns an X number of child processes and passes them a message. It also installs an event handler to listen for any messages sent back from the child (with the result, for instance).
The child process waits for messages from the parent, and starts processing (in this case, it just starts a timer with a random timeout to simulate some work being done). Once it's done, it sends the result back to the parent process and uses process.disconnect() to disconnect itself from the parent (basically stopping the child process).
The parent process keeps track of the number of child processes started, and the number of them that have sent back a result. When those numbers are equal, the parent received all results from the child processes so it can combine all results and return the JSON result.
For a distributed problem like this, I've used zmq and it has worked really well. I'll give you a similar problem that I ran into, and attempted to solve via processes (but failed.) and then turned towards zmq.
Using bcrypt, or an expensive hashing algorith, is wise, but it blocks the node process for around 0.5 seconds. We had to offload this to a different server, and as a quick fix, I used essentially exactly what you did. Run a child process and send messages to it and get it to
respond. The only issue we found is for whatever reason our child process would pin an entire core when it was doing absolutely no work.(I still haven't figured out why this happened, we ran a trace and it appeared that epoll was failing on stdout/stdin streams. It would also only happen on our Linux boxes and would work fine on OSX.)
edit:
The pinning of the core was fixed in https://github.com/joyent/libuv/commit/12210fe and was related to https://github.com/joyent/node/issues/5504, so if you run into the issue and you're using centos + kernel v2.6.32: update node, or update your kernel!
Regardless of the issues I had with child_process.fork(), here's a nifty pattern I always use
client:
var child_process = require('child_process');
function FileParser() {
this.__callbackById = [];
this.__callbackIdIncrement = 0;
this.__process = child_process.fork('./child');
this.__process.on('message', this.handleMessage.bind(this));
}
FileParser.prototype.handleMessage = function handleMessage(message) {
var error = message.error;
var result = message.result;
var callbackId = message.callbackId;
var callback = this.__callbackById[callbackId];
if (! callback) {
return;
}
callback(error, result);
delete this.__callbackById[callbackId];
};
FileParser.prototype.parse = function parse(data, callback) {
this.__callbackIdIncrement = (this.__callbackIdIncrement + 1) % 10000000;
this.__callbackById[this.__callbackIdIncrement] = callback;
this.__process.send({
data: data, // optionally you could pass in the path of the file, and open it in the child process.
callbackId: this.__callbackIdIncrement
});
};
module.exports = FileParser;
child process:
process.on('message', function(message) {
var callbackId = message.callbackId;
var data = message.data;
function respond(error, response) {
process.send({
callbackId: callbackId,
error: error,
result: response
});
}
// parse data..
respond(undefined, "computed data");
});
We also need a pattern to synchronize the different processes, when each process finishes its task, it will respond to us, and we'll increment a count for each process that finishes, and then call the callback of the Semaphore when we've hit the count we want.
function Semaphore(wait, callback) {
this.callback = callback;
this.wait = wait;
this.counted = 0;
}
Semaphore.prototype.signal = function signal() {
this.counted++;
if (this.counted >= this.wait) {
this.callback();
}
}
module.exports = Semaphore;
here's a use case that ties all the above patterns together:
var FileParser = require('./FileParser');
var Semaphore = require('./Semaphore');
var arrFileParsers = [];
for(var i = 0; i < require('os').cpus().length; i++){
var fileParser = new FileParser();
arrFileParsers.push(fileParser);
}
function getFiles() {
return ["file", "file"];
}
var arrResults = [];
function onAllFilesParsed() {
console.log('all results completed', JSON.stringify(arrResults));
}
var lock = new Semaphore(arrFileParsers.length, onAllFilesParsed);
arrFileParsers.forEach(function(fileParser) {
var arrFiles = getFiles(); // you need to decide how to split the files into 1k chunks
fileParser.parse(arrFiles, function (error, result) {
arrResults.push(result);
lock.signal();
});
});
Eventually I used http://zguide.zeromq.org/page:all#The-Load-Balancing-Pattern, where the client was using the nodejs zmq client, and the workers/broker were written in C. This allowed us to scale this across multiple machines, instead of just a local machine with sub processes.
I m trying to implement a long polling strategy with node.js
What i want is when a request is made to node.js it will wait maximum 30 seconds for some data to become available. If there is data, it will output it and exit and if there is no data, it will just wait out 30 seconds max, and then exit.
here is the basic code logic i came up with -
var http = require('http');
var poll_function = function(req,res,counter)
{
if(counter > 30)
{
res.writeHeader(200,{'Content-Type':'text/html;charset=utf8'});
res.end('Output after 5 seconds!');
}
else
{
var rand = Math.random();
if(rand > 0.85)
{
res.writeHeader(200,{'Content-Type':'text/html;charset=utf8'});
res.end('Output done because rand: ' + rand + '! in counter: ' + counter);
}
}
setTimeout
(
function()
{
poll_function.apply(this,[req,res,counter+1]);
},
1000
);
};
http.createServer
(
function(req,res)
{
poll_function(req,res,1);
}
).listen(8088);
What i figure is, When a request is made the poll_function is called which calls itself after 1 second, via a setTimeout within itself. So, it should remain asynchronous means, it will not block other requests and will provide its output when its done.
I have used a Math.random() logic here to simulate data availability scenario at various interval.
Now, what i concern is -
1) Will there be any problem with it? - I simply don't wish to deploy it, without being sure it will not strike back!
2) Is it efficient? if not, any suggestion how can i improve it?
Thanks,
Anjan
All nodejs code is nonblocking as long as you don't get hunk in a tight CPU loop (like while(true)) or use a library that has blocking I/O. Putting a setTimeout at the end of a function doesn't make it any more parallel, it just defers some cpu work till a later event.
Here is a simple demo chat server that randomly emits "Hello World" every 0 to 60 seconds to and and all connection clients.
// A simple chat server using long-poll and timeout
var Http = require('http');
// Array of open callbacks listening for a result
var listeners = [];
Http.createServer(function (req, res) {
function onData(data) {
res.end(data);
}
listeners.push(onData);
// Set a timeout of 30 seconds
var timeout = setTimeout(function () {
// Remove our callback from the listeners array
listeners.splice(listeners.indexOf(onData), 1);
res.end("Timeout!");
}, 30000);
}).listen(8080);
console.log("Server listening on 8080");
function emitEvent(data) {
for (var i = 0; l = listeners.length; i < l; i++) {
listeners[i](data);
}
listeners.length = 0;
}
// Simulate random events
function randomEvents() {
emitData("Hello World");
setTimeout(RandomEvents, Math.random() * 60000);
}
setTimeout(RandomEvents, Math.random() * 60000);
This will be quite fast. The only dangerous part is the splice. Splice can be slow if the array gets very large. This can be made possibly more efficient by instead of closing the connection 30 seconds from when it started to closing all the handlers at once every 30 seconds or 30 seconds after the last event. But again, this is unlikely to be the bottleneck since each of those array items is backed by a real client connection that probably more expensive.