Throttle CPU NODE.JS action to allow new calls to be processed - node.js

I have an expressJS application that accepts a request that results in 1K to 50K fs.link() actions being executed. (it might even hit 500K).
The request (a POST) is not being held up while this occurs. I immediately fire of a res.send() which makes the client happy.
But the server then "forks" the job below, which needs to go and do all the fs.links() which do happen async, but the amount of work (CPU, DISK etc.) means that the ExpressJS service is not very responsive to new requests during this time.
Is there some easy way (other than childProcess) to simulate the forking of a low priority thread that would be doing these file linking?
Job.prototype.runJob = function (next) {
var self = this;
var max = this.files.length;
var count = 0;
async.each(this.files,
function (file, step) {
var src = path.join(self.sourcePath, file.path);
var base = path.basename(src);
var dest = path.join(self.root, base);
fs.link(src, dest, function (err) {
if (err) {
// logger.addLog('warn', "fs.link failed for file: %s", err.message, { file: src });
self.filesMissingList.push(src);
self.errors = true;
self.filesMissing++;
} else {
self.filesFound++;
}
self.batch.update({ tilesCount: ++count, tilesMax: max, done: false });
step(null);
});
},
function (err) {
self.batch.update({ tilesCount: count, tilesMax: max, done: true });
next(null, "FalconView Linking of: " + self.type + " run completed");
});
}

You could use the webworker-threads module, which is good for spinning CPU-intensive tasks onto other threads. Alternatively, you could abuse cluster, but it's really the wrong tool for the job. (The cluster module is really better for scaling up web services, not for doing intensive tasks.)

You can try to Use async.eachLimit instead of async.each. This way you can control how many iterations you process before an expressJS process.

Related

How to run asynchronous tasks synchronous?

I'm developing an app with the following node.js stack: Express/Socket.IO + React. In React I have DataTables, wherein you can search and with every keystroke the data gets dynamically updated! :)
I use Socket.IO for data-fetching, so on every keystroke the client socket emits some parameters and the server calls then the callback to return data. This works like a charm, but it is not garanteed that the returned data comes back in the same order as the client sent it.
To simulate: So when I type in 'a', the server responds with this same 'a' and so for every character.
I found the async module for node.js and tried to use the queue to return tasks in the same order it received it. For simplicity I delayed the second incoming task with setTimeout to simulate a slow performing database-query:
Declaration:
const async = require('async');
var queue = async.queue(function(task, callback) {
if(task.count == 1) {
setTimeout(function() {
callback();
}, 3000);
} else {
callback();
}
}, 10);
Usage:
socket.on('result', function(data, fn) {
var filter = data.filter;
if(filter.length === 1) { // TEST SYNCHRONOUSLY
queue.push({name: filter, count: 1}, function(err) {
fn(filter);
// console.log('finished processing slow');
});
} else {
// add some items to the queue
queue.push({name: filter, count: filter.length}, function(err) {
fn(data.filter);
// console.log('finished processing fast');
});
}
});
But the way I receive it in the client console, when I search for abc is as follows:
ab -> abc -> a(after 3 sec)
I want it to return it like this: a(after 3sec) -> ab -> abc
My thought is that the queue runs the setTimeout and then goes further and eventually the setTimeout gets fired somewhere on the event loop later on. This resulting in returning later search filters earlier then the slow performing one.
How can i solve this problem?
First a few comments, which might help clear up your understanding of async calls:
Using "timeout" to try and align async calls is a bad idea, that is not the idea about async calls. You will never know how long an async call will take, so you can never set the appropriate timeout.
I believe you are misunderstanding the usage of queue from async library you described. The documentation for the queue can be found here.
Copy pasting the documentation in here, in-case things are changed or down:
Creates a queue object with the specified concurrency. Tasks added to the queue are processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one becomes available. Once a worker completes a task, that task's callback is called.
The above means that the queue can simply be used to priorities the async task a given worker can perform. The different async tasks can still be finished at different times.
Potential solutions
There are a few solutions to your problem, depending on your requirements.
You can only send one async call at a time and wait for the first one to finish before sending the next one
You store the results and only display the results to the user when all calls have finished
You disregard all calls except for the latest async call
In your case I would pick solution 3 as your are searching for something. Why would you use care about the results for "a" if they are already searching for "abc" before they get the response for "a"?
This can be done by giving each request a timestamp and then sort based on the timestamp taking the latest.
SOLUTION:
Server:
exports = module.exports = function(io){
io.sockets.on('connection', function (socket) {
socket.on('result', function(data, fn) {
var filter = data.filter;
var counter = data.counter;
if(filter.length === 1 || filter.length === 5) { // TEST SYNCHRONOUSLY
setTimeout(function() {
fn({ filter: filter, counter: counter}); // return to client
}, 3000);
} else {
fn({ filter: filter, counter: counter}); // return to client
}
});
});
}
Client:
export class FilterableDataTable extends Component {
constructor(props) {
super();
this.state = {
endpoint: "http://localhost:3001",
filters: {},
counter: 0
};
this.onLazyLoad = this.onLazyLoad.bind(this);
}
onLazyLoad(event) {
var offset = event.first;
if(offset === null) {
offset = 0;
}
var filter = ''; // filter is the search character
if(event.filters.result2 != undefined) {
filter = event.filters.result2.value;
}
var returnedData = null;
this.state.counter++;
this.socket.emit('result', {
offset: offset,
limit: 20,
filter: filter,
counter: this.state.counter
}, function(data) {
returnedData = data;
console.log(returnedData);
if(returnedData.counter === this.state.counter) {
console.log('DATA: ' + JSON.stringify(returnedData));
}
}
This however does send unneeded data to the client, which in return ignores it. Somebody any idea's for further optimizing this kind of communication? For example a method to keep old data at the server and only send the latest?

setTimeout or child_process.spawn?

I have a REST service in Node.js with one specific request running a bunch of DB commands and other file processing that could take 10-15 seconds to run. Since I didn't want to hold up my browser request thread, I wrote a separate .js script to do the needful, called the script using child_process.spawn() in my Node.js code and immediately returned OK back to the client. This works fine, but then so does calling the same script (as a local function) by just using a simple setTimeout.
router.post("/longRequest", function(req, res) {
console.log("Started long request with id: " + req.body.id);
var longRunningFunction = function() {
// Usually runs a bunch of things that take time.
// Simulating a 10 sec delay for sample code.
setTimeout(function() {
console.log("Done processing for 10 seconds")
}, 10000);
}
// Below line used to be
// child_process.spawn('longRunningFunction.js'
setTimeout(longRunningFunction, 0);
res.json({status: "OK"})
})
So, this works for my purpose. But what's the downside ? I probably can't monitor the offline process easily as child_process.spawn which would give me a process id. But, does this cause problems in the long run ? Will it hold up Node.js processing if the 10 second processing increases to a lot more in the future ?
The actual longRunningFunction is something that reads an Excel file, parses it and does a bulk load using tedious to a MS SQL Server.
var XLSX = require('xlsx');
var FileAPI = require('file-api'), File = FileAPI.File, FileList = FileAPI.FileList, FileReader = FileAPI.FileReader;
var Connection = require('tedious').Connection;
var Request = require('tedious').Request;
var TYPES = require('tedious').TYPES;
var importFile = function() {
var file = new File(fileName);
if (file) {
var reader = new FileReader();
reader.onload = function (evt) {
var data = evt.target.result;
var workbook = XLSX.read(data, {type: 'binary'});
var ws = workbook.Sheets[workbook.SheetNames[0]];
var headerNames = XLSX.utils.sheet_to_json( ws, { header: 1 })[0];
var data = XLSX.utils.sheet_to_json(ws);
var bulkLoad = connection.newBulkLoad(tableName, function (error, rowCount) {
if (error) {
console.log("bulk upload error: " + error);
} else {
console.log('inserted %d rows', rowCount);
}
connection.close();
});
// setup your columns - always indicate whether the column is nullable
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
bulkLoad.addColumn(columnName, columnsAndDataTypes[columnName].dataType, { length: columnsAndDataTypes[columnName].len, nullable: true });
})
data.forEach(function(row) {
var addRow = {}
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
addRow[columnName] = row[columnName];
})
bulkLoad.addRow(addRow);
})
// execute
connection.execBulkLoad(bulkLoad);
};
reader.readAsBinaryString(file);
} else {
console.log("No file!!");
}
};
So, this works for my purpose. But what's the downside ?
If you actually have a long running task capable of blocking the event loop, then putting it on a setTimeout() is not stopping it from blocking the event loop at all. That's the downside. It's just moving the event loop blocking from right now until the next tick of the event loop. The event loop will be blocked the same amount of time either way.
If you just did res.json({status: "OK"}) before running your code, you'd get the exact same result.
If your long running code (which you describe as file and database operations) is actually blocking the event loop and it is properly written using async I/O operations, then the only way to stop blocking the event loop is to move that CPU-consuming work out of the node.js thread.
That is typically done by clustering, moving the work to worker processes or moving the work to some other server. You have to have this work done by another process or another server in order to get it out of the way of the event loop. A setTimeout() by itself won't accomplish that.
child_process.spawn() will accomplish that. So, if you have an actual event loop blocking problem to solve and the I/O is already as async optimized as possible, then moving it to a worker process is a typical node.js solution. You can communicate with that child process in a number of ways, but one possibility would be via stdin and stdout.

nodeJS too many child processes?

I am using node to recursively traverse a file system and make a system call for each file, by using child.exec. It works well when tested on a small structure, with a couple of folders and files, but when run on the whole home directory, it crashes after a while
child_process.js:945
throw errnoException(process._errno, 'spawn');
^
Error: spawn Unknown system errno 23
at errnoException (child_process.js:998:11)
at ChildProcess.spawn (child_process.js:945:11)
at exports.spawn (child_process.js:733:9)
at Object.exports.execFile (child_process.js:617:15)
at exports.exec (child_process.js:588:18)
Does this happen because it uses up all resources? How can I avoid this?
EDIT: Code
improvement and best practices suggestions always welcome :)
function processDir(dir, callback) {
fs.readdir(dir, function (err, files) {
if (err) {...}
if (files) {
async.each(files, function (file, cb) {
var filePath = dir + "/" + file;
var stats = fs.statSync(filePath);
if (stats) {
if (stats.isFile()) {
processFile(dir, file, function (err) {
if (err) {...}
cb();
});
} else if (stats.isDirectory()) {
processDir(filePath, function (err) {
if (err) {...}
cb();
});
}
}
}, function (err) {
if (err) {...}
callback();
}
);
}
});
}
the issue can be because of having many open files simultaneously
consider using async module to solve the issue
https://github.com/caolan/async#eachLimit
async.eachLimit(
files,
20,
function(file, callback){
//process file here and call callback
},
function(err){
//done
}
);
in current example you will process 20 files at a time
Well, I don't know the reason for the failure, but if this is what you expect (using up all of the resources) or as others say (too many files open), you could try to use multitasking for it. JXcore (fork of Node.JS) offers such thing - it allows to run a task in a separate instance, but this is done still inside one single process.
While Node.JS app as a process has its limitations - JXcore with its sub-instances multiplies those limits: single process even with one extra instance (or task, or well: we can call it sub-thread) doubles the limits!
So, let's say, that you will run each of your spawn() in a separate task. Or, since tasks are not running in a main thread any more - you can even use sync method that jxcore offers : cmdSync().
Probably the the best illustration would be given by this few lines of the code:
jxcore.tasks.setThreadCount(4);
var task = function(file) {
var your_cmd = "do something with " + file;
return jxcore.utils.cmdSync(your_cmd);
};
jxcore.tasks.addTask(task, "file1.txt", function(ret) {
console.log("the exit code:", ret.exitCode);
console.log("output:", ret.out);
});
Let me repeat: the task will not block the main thread, since it is running in a separate instance!
Multitasking API is documented here: Multitasking.
As has been established in comments, you are likely running out of file handles because you are running too many concurrent operations on your files. So, a solution is to limit the number of concurrent operations that run at once so too many files aren't in use at the same time.
Here's a somewhat different implementation that uses Bluebird promises to control both the async aspects of the operation and the concurrency aspects of the operation.
To make the management of the concurrency aspect easier, this collects the entire list of files into an array first and then processes the array of filenames rather than processing as you go. This makes it easier to use a built-in concurrency capability in Bluebird's .map() (which works on a single array) so we don't have to write that code ourselves:
var Promise = require("bluebird");
var fs = Promise.promisifyAll(require("fs"));
var path = require("path");
// recurse a directory, call a callback on each file (that returns a promise)
// run a max of numConcurrent callbacks at once
// returns a promise for when all work is done
function processDir(dir, numConcurrent, fileCallback) {
var allFiles = [];
function listDir(dir, list) {
var dirs = [];
return fs.readdirAsync(dir).map(function(file) {
var filePath = path.join(dir , file);
return fs.statAsync(filePath).then(function(stats) {
if (stats.isFile()) {
allFiles.push(filePath);
} else if (stats.isDirectory()) {
return listDir(filePath);
}
}).catch(function() {
// ignore errors on .stat - file could just be gone now
return;
});
});
}
return listDir(dir, allFiles).then(function() {
return Promise.map(allFiles, function(filename) {
return fileCallback(filename);
}, {concurrency: numConcurrent});
});
}
// example usage:
// pass the initial directory,
// the number of concurrent operations allowed at once
// and a callback function (that returns a promise) to process each file
processDir(process.cwd(), 5, function(file) {
// put your own code here to process each file
// this is code to cause each callback to take a random amount of time
// for testing purposes
var rand = Math.floor(Math.random() * 500) + 500;
return Promise.delay(rand).then(function() {
console.log(file);
});
}).catch(function(e) {
// error here
}).finally(function() {
console.log("done");
});
FYI, I think you'll find that proper error propagation and proper error handling from many async operations is much, much easier with promises than the plain callback method.

Console.time always returns 0.000ms

I'm using node-webkit to create a album manager and I'm setting up a recursive scan to find all my photos. I'm scanning some 10k files, but console.time just keeps returning 0.000ms. I know the scan is happening pretty quick, but it's not that quick. Am I doing something wrong?
var fs = require('fs');
var path = 'I:/pictures/';
console.time('read-directory');
var scanDirectory = function(path) {
fs.readdir(path,function(err,files) {
if(err) {
console.log(err);
} else {
files.forEach(function(file) {
fs.stat(path + file, function(err,stats) {
if(err) {
console.log(err);
} else {
if(stats.isDirectory()) {
scanDirectory(path + file + '/');
} else {
console.log(path + file);
}
}
});
});
}
});
}
scanDirectory(path);
console.timeEnd('read-directory');
You are using fs.readdir which is asynchronous. So your timer is not depending of your scanDirectory execution.
In fact, it's just launching your function when you call scanDirectory(path) then directly after stop the timer.
If you want you can use fs.readdirSync which will prevent to jump to the timer end as it's a synchronise function. The problems, is that will freeze your application (if your use it's directly like that) during this time and problably slow your execution.
In order to get the time of execution of your asynchrone function you can use the profiler tool of Node-webkit. But you will need to filter and sum them manually...
The other solution is to use timely (it's an npm package ) that can time synchronous or asynchronous functions.

node.js process out of memory in http.request loop

In my node.js server i cant figure out, why it runs out of memory. My node.js server makes a remote http request for each http request it receives, therefore i've tried to replicate the problem with the below sample script, that also runs out of memory.
This only happens if the iterations in the for loop are very high.
From my point of view, the problem is related to the fact that node.js is queueing the remote http requests. How to avoid this?
This is the sample script:
(function() {
var http, i, mypost, post_data;
http = require('http');
post_data = 'signature=XXX%7CPSFA%7Cxxxxx_value%7CMyclass%7CMysubclass%7CMxxxxx&schedule=schedule_name_6569&company=XXXX';
mypost = function(post_data, cb) {
var post_options, req;
post_options = {
host: 'myhost.com',
port: 8000,
path: '/set_xxxx',
method: 'POST',
headers: {
'Content-Length': post_data.length
}
};
req = http.request(post_options, function(res) {
var res_data;
res.setEncoding('utf-8');
res_data = '';
res.on('data', function(chunk) {
return res_data += chunk;
});
return res.on('end', function() {
return cb();
});
});
req.on('error', function(e) {
return console.debug('TM problem with request: ' + e.message);
});
req.write(post_data);
return req.end;
};
for (i = 1; i <= 1000000; i++) {
mypost(post_data, function() {});
}
}).call(this);
$ node -v
v0.4.9
$ node sample.js
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Tks in advance
gulden PT
Constraining the flow of requests into the server
It's possible to prevent overload of the built-in Server and its HTTP/HTTPS variants by setting the maxConnections property on the instance. Setting this property will cause node to stop accept()ing connections and force the operating system to drop requests when the listen() backlog is full and the application is already handling maxConnections requests.
Throttling outgoing requests
Sometimes, it's necessary to throttle outgoing requests, as in the example script from the question.
Using node directly or using a generic pool
As the question demonstrates, unchecked use of the node network subsystem directly can result in out of memory errors. Something like node-pool makes the active pool management attractive, but it doesn't solve the fundamental problem of unconstrained queuing. The reason for this is that node-pool doesn't provide any feedback about the state of the client pool.
UPDATE: As of v1.0.7 node-pool includes a patch inspired by this post to add a boolean return value to acquire(). The code in the following section is no longer necessary and the example with the streams pattern is working code with node-pool.
Cracking open the abstraction
As demonstrated by Andrey Sidorov, a solution can be reached by tracking the queue size explicitly and mingling the queuing code with the requesting code:
var useExplicitThrottling = function () {
var active = 0
var remaining = 10
var queueRequests = function () {
while(active < 2 && --remaining >= 0) {
active++;
pool.acquire(function (err, client) {
if (err) {
console.log("Error acquiring from pool")
if (--active < 2) queueRequests()
return
}
console.log("Handling request with client " + client)
setTimeout(function () {
pool.release(client)
if(--active < 2) {
queueRequests()
}
}, 1000)
})
}
}
queueRequests(10)
console.log("Finished!")
}
Borrowing the streams pattern
The streams pattern is a solution which is idiomatic in node. Streams have a write operation which returns false when the stream cannot buffer more data. The same pattern can be applied to a pool object with acquire() returning false when the maximum number of clients have been acquired. A drain event is emitted when the number of active clients drops below the maximum. The pool abstraction is closed again and it's possible to omit explicit references to the pool size.
var useStreams = function () {
var queueRequests = function (remaining) {
var full = false
pool.once('drain', function() {
if (remaining) queueRequests(remaining)
})
while(!full && --remaining >= 0) {
console.log("Sending request...")
full = !pool.acquire(function (err, client) {
if (err) {
console.log("Error acquiring from pool")
return
}
console.log("Handling request with client " + client)
setTimeout(pool.release, 1000, client)
})
}
}
queueRequests(10)
console.log("Finished!")
}
Fibers
An alternative solution can be obtained by providing a blocking abstraction on top of the queue. The fibers module exposes coroutines that are implemented in C++. By using fibers, it's possible to block an execution context without blocking the node event loop. While I find this approach to be quite elegant, it is often overlooked in the node community because of a curious aversion to all things synchronous-looking. Notice that, excluding the callcc utility, the actual loop logic is wonderfully concise.
/* This is the call-with-current-continuation found in Scheme and other
* Lisps. It captures the current call context and passes a callback to
* resume it as an argument to the function. Here, I've modified it to fit
* JavaScript and node.js paradigms by making it a method on Function
* objects and using function (err, result) style callbacks.
*/
Function.prototype.callcc = function(context /* args... */) {
var that = this,
caller = Fiber.current,
fiber = Fiber(function () {
that.apply(context, Array.prototype.slice.call(arguments, 1).concat(
function (err, result) {
if (err)
caller.throwInto(err)
else
caller.run(result)
}
))
})
process.nextTick(fiber.run.bind(fiber))
return Fiber.yield()
}
var useFibers = function () {
var remaining = 10
while(--remaining >= 0) {
console.log("Sending request...")
try {
client = pool.acquire.callcc(this)
console.log("Handling request with client " + client);
setTimeout(pool.release, 1000, client)
} catch (x) {
console.log("Error acquiring from pool")
}
}
console.log("Finished!")
}
Conclusion
There are a number of correct ways to approach the problem. However, for library authors or applications that require a single pool to be shared in many contexts it is best to properly encapsulate the pool. Doing so helps prevent errors and produces cleaner, more modular code. Preventing unconstrained queuing then becomes an evented dance or a coroutine pattern. I hope this answer dispels a lot of FUD and confusion around blocking-style code and asynchronous behavior and encourages you to write code which makes you happy.
yes, you trying to queue 1000000 requests before even starting them. This version keeps limited number of request (100):
function do_1000000_req( cb )
{
num_active = 0;
num_finished = 0;
num_sheduled = 0;
function shedule()
{
while (num_active < 100 && num_sheduled < 1000000) {
num_active++;
num_sheduled++;
mypost(function() {
num_active--;
num_finished++;
if (num_finished == 1000000)
{
cb();
return;
} else if (num_sheduled < 1000000)
shedule();
});
}
}
}
do_1000000_req( function() {
console.log('done!');
});
the node-pool module can help you. For more détails, see this post (in french), http://blog.touv.fr/2011/08/http-request-loop-in-nodejs.html

Resources