Node.js library for executing long running background tasks - node.js

I have a architecture with a express.js webserver that accepts new tasks over a REST API.
Furthermore, I have must have another process that creates and supervises many other tasks on other servers (distributed system). This process should be running in the background and runs for a very long time (months, years).
Now the questions is:
1)
Should I create one single Node.js app with a task queue such as bull.js/Redis or Celery/Redis that basically launches this long running task once in the beginning.
Or
2)
Should I have two processes, one for the REST API and another daemon processes that schedules and manages the tasks in the distributed system?
I heavily lean towards solution 2).
Drawn:

I am facing the same problem now. as we know nodejs run in single thread. but we can create workers for parallel or handle functions that take some time that we don't want to affect our main server. fortunately nodejs support multi-threading.
take a look at this example:
const worker = require('worker_threads');
const {
Worker, isMainThread, parentPort, workerData
} = require('worker_threads');
if (isMainThread) {
module.exports = function parseJSAsync(script) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: script
});
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0)
reject(new Error(`Worker stopped with exit code ${code}`));
});
});
};
} else {
const { parse } = require('some-js-parsing-library');
const script = workerData;
parentPort.postMessage(parse(script));
}
https://nodejs.org/api/worker_threads.html
search some articles about multi-threading in nodejs. but remember one here , the state cannot be shared with threads. you can use some message-broker like kafka, rabbitmq(my recommended), redis for handling such needs.
kafka is quite difficult to configure in production.
rabbitmq is good because you can store messages, queues and .., in local storage too. but personally I could not find any proper solution for load balancing these threads . maybe this is not your answer, but I hope you get some clue here.

Related

worker thread won't respond after first message?

I'm making a server script and, to make it easier for both hosts and clients to do what they want, I made a customizable server script that runs using nw.js(with a visual interface). Said script was made using web workers since nw.js was having problems with support to worker threads.
Now that NW.js fixed their problems with worker threads, I've been trying to move all the things that were inside the web workers to worker threads, but there's a problem: When the main thread receives the answer from the second thread, the later stops responding to any subsequent message.
For example, running the following code with either NW.js or Node.js itself will return "pong" only once
const { Worker } = require('worker_threads');
const worker = new Worker('const { parentPort } = require("worker_threads");parentPort.once("message",message => parentPort.postMessage({ pong: message })); ', { eval: true });
worker.on('message', message => console.log(message));
worker.postMessage('ping');
worker.postMessage('ping');
How do I configure the worker so it will keep responding to whatever message it receives after the first one?
Because you use EventEmitter.once() method. According to the documentation this method does the next:
Adds a one-time listener function for the event named eventName. The
next time eventName is triggered, this listener is removed and then
invoked.
If you need your worker to process more than one event then use EventEmitter.on()
const worker = new Worker('const { parentPort } = require("worker_threads");' +
'parentPort.on("message",message => parentPort.postMessage({ pong: message }));',
{ eval: true });

Nodejs Cluster Architecture reading from single REDIS instance

I'm using Nodejs cluster module to have multiple workers running.
I created a basic Architecture where there will be a single MASTER process which is basically an express server handling multiple requests and the main task of MASTER would be writing incoming data from requests into a REDIS instance. Other workers(numOfCPUs - 1) will be non-master i.e. they won't be handling any request as they are just the consumers. I have two features namely ABC and DEF. I distributed the non-master workers evenly across features via assigning them type.
For eg: on a 8-core machine:
1 will be MASTER instance handling request via express server
Remaining (8 - 1 = 7) will be distributed evenly. 4 to feature:ABD and 3 to fetaure:DEF.
non-master workers are basically consumers i.e. they read from REDIS in which only MASTER worker can write data.
Here's the code for the same:
if (cluster.isMaster) {
// Fork workers.
for (let i = 0; i < numCPUs - 1; i++) {
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
}
cluster.on('exit', function(worker) {
console.log(`Worker ${worker.process.pid}::type(${worker.type}) died`);
ClusteringUtil.removeWorkerFromList(worker.type);
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
});
// Start consuming on server-start
ABCConsumer.start();
DEFConsumer.start();
console.log(`Master running with process-id: ${process.pid}`);
} else {
console.log('CLUSTER type', cluster.worker.process.env.type, 'running on', process.pid);
if (
cluster.worker.process.env &&
cluster.worker.process.env.type &&
cluster.worker.process.env.type === ServerTypeEnum.EXPRESS
) {
// worker for handling requests
app.use(express.json());
...
}
{
Everything works fine except consumers reading from REDIS.
Since there are multiple consumers of a particular feature, each one reads the same message and start processing individually, which is what I don't want. If there are 4 consumers, 1 is marked as busy and can not consumer until free, 3 are available. Once the message for that particular feature is written in REDIS by MASTER, the problem is all 3 available consumers of that feature start consuming. This means that the for a single message, the job is done based on number of available consumers.
const stringifedData = JSON.stringify(req.body);
const key = uuidv1();
const asyncHsetRes = await asyncHset(type, key, stringifedData);
if (asyncHsetRes) {
await asyncRpush(FeatureKeyEnum.REDIS.ABC_MESSAGE_QUEUE, key);
res.send({ status: 'success', message: 'Added to processing queue' });
} else {
res.send({ error: 'failure', message: 'Something went wrong in adding to queue' });
}
Consumer simply accepts messages and stop when it is busy
module.exports.startHeartbeat = startHeartbeat = async function(config = {}) {
if (!config || !config.type || !config.listKey) {
return;
}
heartbeatIntervalObj[config.type] = setInterval(async () => {
await asyncLindex(config.listKey, -1).then(async res => {
if (res) {
await getFreeWorkerAndDoJob(res, config);
stopHeartbeat(config);
}
});
}, HEARTBEAT_INTERVAL);
};
Ideally, a message should be read by only one consumer of that particular feature. After consuming, it is marked as busy so it won't consume further until free(I have handled this). Next message could only be processed by only one consumer out of other available consumers.
Please help me in tacking this problem. Again, I want one message to be read by only one free consumer and rest free consumers should wait for new message.
Thanks
I'm not sure I fully get your Redis consumers architecture, but I feel like it contradicts with the use case of Redis itself. What you're trying to achieve is essentially a queue based messaging with an ability to commit a message once its done.
Redis has its own pub/sub feature, but it is built on fire and forget principle. It doesn't distinguish between consumers - it just sends the data to all of them, assuming that its their logic to handle the incoming data.
I recommend to you use Queue Servers like RabbitMQ. You can achieve your goal with some features that AMQP 0-9-1 supports: message acknowledgment, consumer's prefetch count and so on. You can set up your cluster with very agile configs like ok, I want to have X consumers, and each can handle 1 unique (!) message at a time and they will receive new ones only after they let the server (rabbitmq) know that they successfully finished message processing. This is highly configurable and robust.
However, if you want to go serverless with some fully managed service so that you don't provision like virtual machines or anything else to run a message queue server of your choice, you can use AWS SQS. It has pretty much similar API and features list.
Hope it helps!

Multiple AWS Instances and Node events

I have an implementation in node where an API when called does some processing and waits for an event from another function before returning the response. This works fine when ran locally and when running in a single instance in AWS but when multiple instances are involved there are some issues which I'm assuming is because the API is being called from one instance and the emitter is being emitted in another instance. Is there any way to keep the listeners and emitters same across all instances?
Update :
After some research I found that using an application loadbalancer with some logic for routing can help with this issue. I am marking the answer below as correct because while it did not help me with AWS autoscaling, it did help me find an alernate solution to my problem.
AFAIU you think that event emitted from one process is being handled in a different process, but it never would be the case from what I know because each process has its own memory and also events would be associated with the process only.
I have added a sample code that demonstrates what I meant by it. Maybe if you post the code you are referring to, we could check what went wrong.
const cluster = require("cluster");
const EventEmitter = require("events");
if (cluster.isMaster) {
cluster.fork();
const myEE = new EventEmitter();
myEE.on("foo", arg =>
console.log("emitted from ", arg, "received in master")
);
setTimeout(() => {
myEE.emit("foo", "master");
}, 1000);
} else {
const myEE = new EventEmitter();
myEE.on("foo", arg => console.log("emitted from", arg, "received in worker"));
setTimeout(() => {
myEE.emit("foo", "client");
}, 2000);
}

Node.js main thread blocking

i'm doing an artificial intelligence training and that method takes to long (5 minutes) and while my method is running my backend stay blocked waiting my method finish, is there any solution for it?
train: (answersIA, search, res) => {
const path = `api/IA/${search.id}`;
if (answersIA.length == 0) {
return null;
}
else {
answersIA = answersIA.filter(answer => answer.score);
answersIA = answersService.setOutput(answersIA);
answersIA = answersService.encodeAll(IAServiceTrain.adjustSize(answersIA));
net.train(answersIA, {
errorThresh: 0.005,
iterations: 100,
log: true,
logPeriod: 10,
learningRate: 0.3
});
mkdirp(path, (err) => {
if (err) throw 'No permission to create trained-net file';
fs.writeFileSync(`${path}/trained-net.js`, `${net.toFunction().toString()};`);
res.json({
message: 'Trained!'
});
});
}
}
Use child_process to perform training on a separate process. Here is how it can be implemented:
trainMaster.js
const fork = require('child_process').fork;
exports.train = (arguments, callback) => {
const worker = fork(`${__dirname}/trainWorker.js`);
worker.send(arguments);
worker.on('message', result => {
callback(result);
worker.kill();
});
}
trainWorker.js
process.on('message', arguments => {
const result = train(arguments)
process.send(result);
});
function train(arguments) { /* your training logic */ }
So when you call train on trainMaster.js from your main application it doesn't do actual training and doesn't block the event loop. Instead it creates a new process, waits for it to perform all the heavy lifting and then terminates it.
This approach should work fine if the number of simultaneous trainings is less than the number of CPU on your machine. From Node.js docs:
It is important to keep in mind that spawned Node.js child processes are independent of the parent with exception of the IPC communication channel that is established between the two. Each process has its own memory, with their own V8 instances. Because of the additional resource allocations required, spawning a large number of child Node.js processes is not recommended.
Otherwise you will need some different approach, like spreading workers across multiple machines and using message queue to distribute the work between them.
As https://stackoverflow.com/users/740553/mike-pomax-kamermans noted, Javascript is single threaded, meaning long running synchronous operations will block your thread while executing. It's not clear what library you're using to perform your training, but check to see if they have any asynchronous methods for training. Also, you're using the synchronous version of the fs.writeFile method, which again will block your thread while executing. To remedy this, use the async version:
fs.writeFile(`${path}/trained-net.js`, `${net.toFunction().toString()};`, function (err) {
if (err) return console.log(err);
console.log('File write completed');
}););

Node.js Cluster: Managing Workers

we're diving deeper in Node.js architecture, to achieve fully understanding, how to scale our application.
Clear solution is cluster usage https://nodejs.org/api/cluster.html. Everything seems to be fine, apart of workers management description:
Node.js does not automatically manage the number of workers for you, however. It is your responsibility to manage the worker pool for your application's needs.
I was searching, how to really manage the workers, but most solutions, says:
Start so many workers as you've got cores.
But I would like to dynamically scale up or down my workers count, depending on current load on server. So if there is load on server and queue is getting longer, I would like to start next worker. In another way, when there isn't so much load, I would like to shut down workers (and leave f.e. minimum 2 of them).
The ideal place, will be for me Master Process queue, and event when new Request is coming to Master Process. On this place we can decide if we need next worker.
Do you have any solution or experience with managing workers from Master Thread in Cluster? Starting and killing them dynamically?
Regards,
Radek
following code will help you to understand to create cluster on request basis.
this program will genrate new cluster in every 10 request.
Note: you need to open http://localhost:8000/ and refresh the page for increasing request.
var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;
var numReqs = 0;
var initialRequest = 10;
var maxcluster = 10;
var totalcluster = 2;
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < 2; i++) {
var worker = cluster.fork();
console.log('cluster master');
worker.on('message', function(msg) {
if (msg.cmd && msg.cmd == 'notifyRequest') {
numReqs++;
}
});
}
setInterval(function() {
console.log("numReqs =", numReqs);
isNeedWorker(numReqs) && cluster.fork();
}, 1000);
} else {
console.log('cluster one initilize');
// Worker processes have a http server.
http.Server(function(req, res) {
res.writeHead(200);
res.end("hello world\n");
// Send message to master process
process.send({ cmd: 'notifyRequest' });
}).listen(8000);
}
function isNeedWorker(numReqs) {
if( numReqs >= initialRequest && totalcluster < numCPUs ) {
initialRequest = initialRequest + 10;
totalcluster = totalcluster + 1;
return true;
} else {
return false;
}
}
To manually manage your workers, you need a messaging layer to facilitate inter process communication. With IPC master and worker can communicate effectively, by default and architecture stand point this behavior is already implemented in the process module native. However i find the native implementation not flexible or robust enough to handle horizontal scaling due to network requests.
One obvious solution Redis as a message broker to facilitate this method of master and slave communication. However this solution also as its faults , which is context latency, directly linked to command and reply.
Further research led me to RabbitMQ,great fit for distributing time-consuming tasks among multiple workers.The main idea behind Work Queues (aka: Task Queues) is to avoid doing a resource-intensive task immediately and having to wait for it to complete. Instead we schedule the task to be done later. We encapsulate a task as a message and send it to the queue. A worker process running in the background will pop the tasks and eventually execute the job. When you run many workers the tasks will be shared between them.
To implement a robust server , read this link , it may give some insights. Link

Resources