node.js challenge setting up cron + cache

node.js challenge setting up cron + cache - node.js

In my app I use node-cache to hold some data that is needed very frequently. This data can change so I am using node-cron to periodically kick off a child process to pull refresh data from Mongo and pass it back to the node-cache to overwrite the key/value pairs.
The data is needed in a middleware. If I run the Cron process in the main server.js file everything works fine, but the node-cache data obviously isn't accessible in the middleware. If I run the Cron process in the middleware....it runs the cron + cache process every time the middleware is used (i.e. triggers the child process +mongo calls, etc defeating the purpose of the cache + cron approach).
How does one go about solving this issue? Thx.
The Cron code that kicks of the Cache (via Child Process) looks like this :
var companyCache = companyCache || new cache();
new CronJob('0 */15 * * * *', function () {
var child = childProcess.fork(process.cwd() + "/lib/cacheCronWorker.js");//,[],{execArgv: ['--debug-brk=55555']});
console.log("initialized child process for cacheCron");
child.send('refresh_company_cache');
child.on('message', function (m) {
if (m !== 'done') {
// Receive results from child process
companyCache.set(m.key, m.currObj, function (err, success) {
if (err) {
console.log("node-cache hit an error: " + err)
}
if (!err && success) {
}
});
}
else if (m === 'done') {
console.log("company cache load is " + m + " disconnecting child process");
child.disconnect();
}
});
console.log('You will see this message every 15 min when we refresh the cache');
}, null, true, 'America/Chicago', null, true);

Related

Node-Cron: how to ensure last job is complete

I am using node-cron to send operational emails (currently searching for new emails that need to be sent every minute).
My function (operationalEmails) looks for mongodb records that have a sent flag = false, sends the email, then changes the sent flag = true.
If an iteration of the cron has a large payload of records to send, it could take more than a minute.
How do I ensure the last iteration of the cron is complete before starting a new one?
//Run every Min
cron.schedule("* * * * *", () => {
operationalEmails();
});

you would need to create a simple lock
const running = false;
function operationalEmails() {
if (running) {
return
}
running = true;
// do stuff
running = false;
}
//Run every Min
cron.schedule("* * * * *", () => {
operationalEmails();
});

How to run asynchronous tasks synchronous?

I'm developing an app with the following node.js stack: Express/Socket.IO + React. In React I have DataTables, wherein you can search and with every keystroke the data gets dynamically updated! :)
I use Socket.IO for data-fetching, so on every keystroke the client socket emits some parameters and the server calls then the callback to return data. This works like a charm, but it is not garanteed that the returned data comes back in the same order as the client sent it.
To simulate: So when I type in 'a', the server responds with this same 'a' and so for every character.
I found the async module for node.js and tried to use the queue to return tasks in the same order it received it. For simplicity I delayed the second incoming task with setTimeout to simulate a slow performing database-query:
Declaration:
const async = require('async');
var queue = async.queue(function(task, callback) {
if(task.count == 1) {
setTimeout(function() {
callback();
}, 3000);
} else {
callback();
}
}, 10);
Usage:
socket.on('result', function(data, fn) {
var filter = data.filter;
if(filter.length === 1) { // TEST SYNCHRONOUSLY
queue.push({name: filter, count: 1}, function(err) {
fn(filter);
// console.log('finished processing slow');
});
} else {
// add some items to the queue
queue.push({name: filter, count: filter.length}, function(err) {
fn(data.filter);
// console.log('finished processing fast');
});
}
});
But the way I receive it in the client console, when I search for abc is as follows:
ab -> abc -> a(after 3 sec)
I want it to return it like this: a(after 3sec) -> ab -> abc
My thought is that the queue runs the setTimeout and then goes further and eventually the setTimeout gets fired somewhere on the event loop later on. This resulting in returning later search filters earlier then the slow performing one.
How can i solve this problem?

First a few comments, which might help clear up your understanding of async calls:
Using "timeout" to try and align async calls is a bad idea, that is not the idea about async calls. You will never know how long an async call will take, so you can never set the appropriate timeout.
I believe you are misunderstanding the usage of queue from async library you described. The documentation for the queue can be found here.
Copy pasting the documentation in here, in-case things are changed or down:
Creates a queue object with the specified concurrency. Tasks added to the queue are processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one becomes available. Once a worker completes a task, that task's callback is called.
The above means that the queue can simply be used to priorities the async task a given worker can perform. The different async tasks can still be finished at different times.
Potential solutions
There are a few solutions to your problem, depending on your requirements.
You can only send one async call at a time and wait for the first one to finish before sending the next one
You store the results and only display the results to the user when all calls have finished
You disregard all calls except for the latest async call
In your case I would pick solution 3 as your are searching for something. Why would you use care about the results for "a" if they are already searching for "abc" before they get the response for "a"?
This can be done by giving each request a timestamp and then sort based on the timestamp taking the latest.

SOLUTION:
Server:
exports = module.exports = function(io){
io.sockets.on('connection', function (socket) {
socket.on('result', function(data, fn) {
var filter = data.filter;
var counter = data.counter;
if(filter.length === 1 || filter.length === 5) { // TEST SYNCHRONOUSLY
setTimeout(function() {
fn({ filter: filter, counter: counter}); // return to client
}, 3000);
} else {
fn({ filter: filter, counter: counter}); // return to client
}
});
});
}
Client:
export class FilterableDataTable extends Component {
constructor(props) {
super();
this.state = {
endpoint: "http://localhost:3001",
filters: {},
counter: 0
};
this.onLazyLoad = this.onLazyLoad.bind(this);
}
onLazyLoad(event) {
var offset = event.first;
if(offset === null) {
offset = 0;
}
var filter = ''; // filter is the search character
if(event.filters.result2 != undefined) {
filter = event.filters.result2.value;
}
var returnedData = null;
this.state.counter++;
this.socket.emit('result', {
offset: offset,
limit: 20,
filter: filter,
counter: this.state.counter
}, function(data) {
returnedData = data;
console.log(returnedData);
if(returnedData.counter === this.state.counter) {
console.log('DATA: ' + JSON.stringify(returnedData));
}
}
This however does send unneeded data to the client, which in return ignores it. Somebody any idea's for further optimizing this kind of communication? For example a method to keep old data at the server and only send the latest?

setTimeout or child_process.spawn?

I have a REST service in Node.js with one specific request running a bunch of DB commands and other file processing that could take 10-15 seconds to run. Since I didn't want to hold up my browser request thread, I wrote a separate .js script to do the needful, called the script using child_process.spawn() in my Node.js code and immediately returned OK back to the client. This works fine, but then so does calling the same script (as a local function) by just using a simple setTimeout.
router.post("/longRequest", function(req, res) {
console.log("Started long request with id: " + req.body.id);
var longRunningFunction = function() {
// Usually runs a bunch of things that take time.
// Simulating a 10 sec delay for sample code.
setTimeout(function() {
console.log("Done processing for 10 seconds")
}, 10000);
}
// Below line used to be
// child_process.spawn('longRunningFunction.js'
setTimeout(longRunningFunction, 0);
res.json({status: "OK"})
})
So, this works for my purpose. But what's the downside ? I probably can't monitor the offline process easily as child_process.spawn which would give me a process id. But, does this cause problems in the long run ? Will it hold up Node.js processing if the 10 second processing increases to a lot more in the future ?
The actual longRunningFunction is something that reads an Excel file, parses it and does a bulk load using tedious to a MS SQL Server.
var XLSX = require('xlsx');
var FileAPI = require('file-api'), File = FileAPI.File, FileList = FileAPI.FileList, FileReader = FileAPI.FileReader;
var Connection = require('tedious').Connection;
var Request = require('tedious').Request;
var TYPES = require('tedious').TYPES;
var importFile = function() {
var file = new File(fileName);
if (file) {
var reader = new FileReader();
reader.onload = function (evt) {
var data = evt.target.result;
var workbook = XLSX.read(data, {type: 'binary'});
var ws = workbook.Sheets[workbook.SheetNames[0]];
var headerNames = XLSX.utils.sheet_to_json( ws, { header: 1 })[0];
var data = XLSX.utils.sheet_to_json(ws);
var bulkLoad = connection.newBulkLoad(tableName, function (error, rowCount) {
if (error) {
console.log("bulk upload error: " + error);
} else {
console.log('inserted %d rows', rowCount);
}
connection.close();
});
// setup your columns - always indicate whether the column is nullable
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
bulkLoad.addColumn(columnName, columnsAndDataTypes[columnName].dataType, { length: columnsAndDataTypes[columnName].len, nullable: true });
})
data.forEach(function(row) {
var addRow = {}
Object.keys(columnsAndDataTypes).forEach(function(columnName) {
addRow[columnName] = row[columnName];
})
bulkLoad.addRow(addRow);
})
// execute
connection.execBulkLoad(bulkLoad);
};
reader.readAsBinaryString(file);
} else {
console.log("No file!!");
}
};

So, this works for my purpose. But what's the downside ?
If you actually have a long running task capable of blocking the event loop, then putting it on a setTimeout() is not stopping it from blocking the event loop at all. That's the downside. It's just moving the event loop blocking from right now until the next tick of the event loop. The event loop will be blocked the same amount of time either way.
If you just did res.json({status: "OK"}) before running your code, you'd get the exact same result.
If your long running code (which you describe as file and database operations) is actually blocking the event loop and it is properly written using async I/O operations, then the only way to stop blocking the event loop is to move that CPU-consuming work out of the node.js thread.
That is typically done by clustering, moving the work to worker processes or moving the work to some other server. You have to have this work done by another process or another server in order to get it out of the way of the event loop. A setTimeout() by itself won't accomplish that.
child_process.spawn() will accomplish that. So, if you have an actual event loop blocking problem to solve and the I/O is already as async optimized as possible, then moving it to a worker process is a typical node.js solution. You can communicate with that child process in a number of ways, but one possibility would be via stdin and stdout.

Throttle CPU NODE.JS action to allow new calls to be processed

I have an expressJS application that accepts a request that results in 1K to 50K fs.link() actions being executed. (it might even hit 500K).
The request (a POST) is not being held up while this occurs. I immediately fire of a res.send() which makes the client happy.
But the server then "forks" the job below, which needs to go and do all the fs.links() which do happen async, but the amount of work (CPU, DISK etc.) means that the ExpressJS service is not very responsive to new requests during this time.
Is there some easy way (other than childProcess) to simulate the forking of a low priority thread that would be doing these file linking?
Job.prototype.runJob = function (next) {
var self = this;
var max = this.files.length;
var count = 0;
async.each(this.files,
function (file, step) {
var src = path.join(self.sourcePath, file.path);
var base = path.basename(src);
var dest = path.join(self.root, base);
fs.link(src, dest, function (err) {
if (err) {
// logger.addLog('warn', "fs.link failed for file: %s", err.message, { file: src });
self.filesMissingList.push(src);
self.errors = true;
self.filesMissing++;
} else {
self.filesFound++;
}
self.batch.update({ tilesCount: ++count, tilesMax: max, done: false });
step(null);
});
},
function (err) {
self.batch.update({ tilesCount: count, tilesMax: max, done: true });
next(null, "FalconView Linking of: " + self.type + " run completed");
});
}

You could use the webworker-threads module, which is good for spinning CPU-intensive tasks onto other threads. Alternatively, you could abuse cluster, but it's really the wrong tool for the job. (The cluster module is really better for scaling up web services, not for doing intensive tasks.)

You can try to Use async.eachLimit instead of async.each. This way you can control how many iterations you process before an expressJS process.

How I can stop async.queue after the first fail?

I want to stop of executing of my async.queue after first task error was occurred. I need to perform several similar actions in parallel with the concurrency restriction, but stop all the actions after first error. How can I do that or what should I use instead?

Assuming you fired 5 parallel functions, each will take 5 seconds. While in 3rd second, function 1 failed. Then how you can stop the execution of the rest?
It depends of what those functions do, you may poll using setInterval. However if your question is how to stop further tasks to be pushed to the queue. You may do this:
q.push(tasks, function (err) {
if (err && !called) {
//Will prevent async to push more tasks to the queue, however please note that
//whatever pushed to the queue, it will be processed anyway.
q.kill();
//This will not allow double calling for the final callback
called = true;
//This the main process callback, the final callback
main(err, results);
}
});
Here a full working example:
var async = require('async');
/*
This function is the actual work you are trying to do.
Please note for example if you are running child processes
here, by doing q.kill you will not stop the execution
of those processes, so you need actually to keep track the
spawned processed and then kill them when you call q.kill
in 'pushCb' function. In-case of just long running function,
you may poll using setInterval
*/
function worker(task, wcb) {
setTimeout(function workerTimeout() {
if (task === 11 || task === 12 || task === 3) {
return wcb('error in processing ' + task);
}
wcb(null, task + ' got processed');
}, Math.floor(Math.random() * 100));
}
/*
This function that will push the tasks to async.queue,
and then hand them to your worker function
*/
function process(tasks, concurrency, pcb) {
var results = [], called = false;
var q = async.queue(function qWorker(task, qcb) {
worker(task, function wcb(err, data) {
if (err) {
return qcb(err); //Here how we propagate error to qcb
}
results.push(data);
qcb();
});
}, concurrency);
/*
The trick is in this function, note that checking q.tasks.length
does not work q.kill introduced in async 0.7.0, it is just setting
the drain function to null and the tasks length to zero
*/
q.push(tasks, function qcb(err) {
if (err && !called) {
q.kill();
called = true;
pcb(err, results);
}
});
q.drain = function drainCb() {
pcb(null, results);
}
}
var tasks = [];
var concurrency = 10;
for (var i = 1; i <= 20; i += 1) {
tasks.push(i);
}
process(tasks, concurrency, function pcb(err, results) {
console.log(results);
if (err) {
return console.log(err);
}
console.log('done');
});

async documentation on github page is either outdated or incorrect, while inspecting the queue object returned by async.queue() method I do not see the method kill().
Nevertheless there is a way around it. Queue object has property tasks which is an array, simply assigning a reference to an empty array did the trick for me.
queue.push( someTasks, function ( err ) {
if ( err ) queue.tasks = [];
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string