Node/Express: running specific CPU-instensive tasks in the background - node.js

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?

The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

Related

Will scheduling tasks on a node server block it processing requests?

I have a very simple express server with 1 endpoint listening on a certain port.
I'd like a separate scheduled script to make an API call every 10 minutes and save the data locally (in memory).
When the endpoint on my server is hit, I'd like to serve the locally cached data from memory.
How can I make sure that the scheduled cron job does not ever block the processing of requests coming in i.e. both continue to happen at the same time.
I've read about child processes and worker threads, but I'm not sure if this would be a good use case for it.
JavaScript is single-threaded, so worker threads and forking notwithstanding, every line of JavaScript code that executes blocks other lines of code from executing. Only one can execute at a time. So forgetting about the scheduled job for a moment, in an express server, multiple requests happening concurrently will compete with each other for CPU resources and block each other.
However, Node.js has asynchronous I/O. A typical request often looks a bit like this:
Receive request and run a little bit of JavaScript code (maybe 1ms).
Query an external REST API (let's say 50ms)
Run a little more JavaScript code and send a response (maybe 1ms).
In steps 1 and 3, all other requests are blocked from executing JavaScript code. They have to wait until the current request is done executing its code. In step 2 however, other requests are free to execute JavaScript code, one after the other, since the network request is non-blocking.
I explain all this because it sounds like your scheduled job might go through the same three steps mentioned above. If that's the case, then functionally it's not going to block anything any more than simultaneous requests to the server are already blocking each other, and there's little reason to worry about threading or multi-processing unless your server is under such heavy load that it cannot serve all incoming requests.
So this is not a good use case for child processes and worker threads unless the scheduled job has different characteristics than I'm assuming (for example if it spends multiple seconds crunching heavy computations in JavaScript, that would noticeably impact request response time when it runs, and might make sense to break out into a separate process or worker thread).

Distributing topics between worker instances with minimum overlap

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?
Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.
RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

How can I prevent similar queues from running at the same time?

We currently process a set of tasks using Queue workers in Laravel. When I am using multiple threads of php artisan queue:work jobs end up running together (async). We are using Beanstalkd as the queue driver.
The issue is that in the queue work we are polling an API that only allows one concurrent session for a particular agent_id. That is, only one API call with the same agent_id can run at a time.
We thought of spinning up multiple php artisan queue:work threads with a filter on the queue_name matching the agent_id but we have over 500 agents therefore we would need 500 threads so this is not ideal.
Is there anyway to implement a lock style feature for each agent_id so that if a job is already running for a particular agent_id it will send it back to the queue? Or are there any features of beanstalkd that would allow for this?
The other option could also be to gracefully handle the rejection from the API when the user is already logged in (and send the job back to the queue). But this could get messy and could clutter the logs.
You could either run only a single worker that is capable of running the fetch-from-API job, or use some sort of external marshalling/lock service.
The options for that, may be either an internal rate limiting system, or some kind of common atomically locking system. A memcached or redis server where a worker tries to set a lock-key, and only the agent that successfully sets it, gets to work on the task. An advantage of that may be that as soon as the API request has been completed, you can remove the lock, and then while the worker processes the results, a different worker can make a new request.

How does Node.js execute two different scripts?

I can't imagine how can node.js in one single thread execute two scripts with different code simultaneously. For example I have two different scripts A and B. What will happen if almost simultaneously several clients request A and B. For PHP it is understandable, for example, will be created five threads to handle A and five threads to handle B, and for each request script executes again. But what happens in Node.js? Thank you!
It uses the so called event loop, implemented by libuv
A very simple explanation would be: when a new event occurs, it will be put into a queue. Every now and then, the node process will interupt execution to process these events.
The main difference between PHP and node is that a node.js process is essentially a stand-alone web server (single threaded), while PHP is an interpreter that runs within a web server (i.e. Apache), which is responsible for creating new threads for each request.
Node.js is very good for network applications (like web sites) because in these applications most of the work is I/O which node.js handles asynchronously.
Even if two requests arrive at the same time and Node.js only has one single thread of execution, each one of the requests (in sequence) will be handed off to the operating system for I/O (via libuv as mihai pointed out) and the fact that there is only one JavaScript thread of execution becomes irrelevant. As the I/O completes, the JavaScript thread picks up the result and returns a response.

How node.js works?

I don't understand several things about nodejs. Every information source says that node.js is more scalable than standard threaded web servers due to the lack of threads locking and context switching, but I wonder, if node.js doesn't use threads how does it handle concurrent requests in parallel? What does event I/O model means?
Your help is much appreciated.
Thanks
Node is completely event-driven. Basically the server consists of one thread processing one event after another.
A new request coming in is one kind of event. The server starts processing it and when there is a blocking IO operation, it does not wait until it completes and instead registers a callback function. The server then immediately starts to process another event (maybe another request). When the IO operation is finished, that is another kind of event, and the server will process it (i.e. continue working on the request) by executing the callback as soon as it has time.
So the server never needs to create additional threads or switch between threads, which means it has very little overhead. If you want to make full use of multiple hardware cores, you just start multiple instances of node.js
Update
At the lowest level (C++ code, not Javascript), there actually are multiple threads in node.js: there is a pool of IO workers whose job it is to receive the IO interrupts and put the corresponding events into the queue to be processed by the main thread. This prevents the main thread from being interrupted.
Although Question is already explained before a long time, I'm putting my thoughts on the same.
Node.js is single threaded JavaScript runtime environment. Basically it's creator Ryan Dahl concern was that parallel processing using multiple threads is not the right way or too complicated.
if Node.js doesn't use threads how does it handle concurrent requests in parallel
Ans: It's completely wrong sentence when you say it doesn't use threads, Node.js use threads but in a smart way. It uses single thread to serve all the HTTP requests & multiple threads in thread pool(in libuv) for handling any blocking operation
Libuv: A library to handle asynchronous I/O.
What does event I/O model means?
Ans: The right term is non-blocking I/O. It almost never blocks as Node.js official site says. When any request goes to node server it never queues the request. It take request and start executing if it's blocking operation then it's been sent to working threads area and registered a callback for the same as soon as code execution get finished, it trigger the same callback and goes to event queue and processed by event loop again after that create response and send to the respective client.
Useful link:
click here
Node JS is a JavaScript runtime environment. Both browser and Node JS run on V8 JavaScript engine. Node JS uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node JS applications uses single threaded event loop architecture to handle concurrent clients. Actually its' main event loop is single threaded but most of the I/O works on separate threads, because the I/O APIs in Node JS are asynchronous/non-blocking by design, in order to accommodate the main event loop. Consider a scenario where we request a backend database for the details of user1 and user2 and then print them on the screen/console. The response to this request takes time, but both of the user data requests can be carried out independently and at the same time. When 100 people connect at once, rather than having different threads, Node will loop over those connections and fire off any events your code should know about. If a connection is new it will tell you .If a connection has sent you data, it will tell you .If the connection isn’t doing anything ,it will skip over it rather than taking up precision CPU time on it. Everything in Node is based on responding to these events. So we can see the result, the CPU stay focused on that one process and doesn’t have a bunch of threads for attention.There is no buffering in Node.JS application it simply output the data in chunks.
Though its been answered , i would like to just share my understandings in simple terms
Nodejs uses a library called Libuv , so this Libuv is written in C
language which uses the concept of threads . These threads are called
as workers and these workers take care of the multiple requests from client.
Parallel processing in nodejs is achieved with the help of 2 concepts
Asynchronous
Non blocking IO

Resources