How to test master behaviour in a Node.JS cluster? - node.js

Suppose you are running a cluster in Node.JS and you wish to unit-test it. For instance, you'd like to make sure that if a worker dies the cluster takes some action, such as forking another worker and possibly some related job. Or that, under certain conditions, additional workers are spawned.
I suppose that in order to do this one must launch the cluster and have somehow access to its internal state; then (for instance) force workers to get stuck, and check the state after a delay. If so, how to export the state?

You'll have to architect your master to return a reference to its cluster object. In your tests, you can kill one of its workers with cluster.workers[2].kill(). The worker object also has a reference to the child's process object, which you can use to simulate various conditions. You may have to use a setTimeout to ensure the master has the time to do its thing.
The above methods however still creates forks, which may be undesirable in a testing scenario. Your other option is to use a mocking library (SinonJS et al) to mock out cluster's fork method, and then spy the number of calls it gets. You can simulate worker death by using cluster.emit('exit') on the master cluster object.
Note: I'm not sure if this is an issue only with me, but cluster.emit always seems to emit twice for me, for some reason.

Related

Does NodeJS spin up a new process for new reqest?

I have a backend NodeJS API and I am trying to setting trace id. What I have been thinking is that I would generate a UUID through a Singleton module and then use it across for logging. But since NodeJS is single-threaded, would that mean that UUID will always remain the same for all clients?
For eg: If the API gets a request from https://www.example.com/client-1 and https://www.example-two.com/client-2, would it spin a new process and thereby generate separate UUIDs? or it's just one process that would be running with a single thread? If it's just one process with one thread then I think both the client apps will get the same UUID assigned.
Is this understanding correct?
Nodejs uses only one single thread to run all your Javascript (unless you specifically create a WorkerThread or child_process). Nodejs uses some threads internally for use in some of the library functions, but those aren't used for running your Javascript and are transparent to you.
So, unlike some other environments, each new request runs in the same thread. There is no new process or thread created for an incoming request.
If you use some singleton, it will have the same value for every request.
But since NodeJS is single threaded, would that mean that UUID will always remains the same for all clients?
Yes, the UUID would be the same for all requests.
For eg: If the API gets a request from https://www.example.com/client-1 and https://www.example-two.com/client-2, would it spin a new process and thereby generate separate UUIDs?
No, it would not spin a new process and would not generate a new UUID.
or it's just one process that would be running with a single thread? If it's just one process with one thread then I think both the client apps will get the same UUID assigned.
One process. One thread. Same UUID from a singleton.
If you're trying to put some request-specific UUID in every log statement, then there aren't many options. The usual option is to coin a new UUID for each new request in some middleware and attach it to the req object as a property such as req.uuid and then pass the req object or the uuid itself as a function argument to all code that might want to have access to it.
There is also a technology that has been called "async local storage" that could serve you here. Here's the doc. It can be used kind of like "thread local storage" works in other environments that do use a thread for each new request. It provides some local storage that is tied to an execution context which each incoming request that is still being processed will have, even as it goes through various asynchronous operations and even when it returns control temporarily back to the event loop.
As best I know, the async local storage interface has undergone several different implementations and is still considered experimental.
See this diagram to understand ,how node js server handles requests as compared to other language servers
So in your case there won't be a separate thread
And unless you are creating a separate process by using pm2 to run your app or explicitly creating the process using internal modules ,it won't be a separate process
Node.js is a single thread run-time environment provided that internally it does assign threads for requests that block the event loop.
What I have been thinking is that I would generate a UUID through a
Singleton module
Yes, it will generate UUID only once and every time you have new request it will reuse the same UUID, this is the main aim of using the Singleton design pattern.
would it spin a new process and thereby generate separate UUIDs? or
it's just one process that would be running with a single thread?
The process is the instance of any computer program that can have one or multiple threads in this case it is Node.js(the process), the event loop and execution context or stack are two threads part of this process. Every time the request is received, it will go to the event loop and then be passed to the stack for its execution.
You can create a separate process in Node.js using child modules.
Is this understanding correct?
Yes, your understanding is correct about the UUID Singleton pattern. I would recommend you to see how Node.js processes the request. This video helps you understand how the event loop works.

nodejs - Dedicated workers library

I want to create a pool of N workers, and pass work to them over time.
I checked out workerpool and workerfarm but from what I see, these create the worker when work is posted. I.e time lost when initializing the worker.
I could just use the child_process module to do this myself, but would then have to set up fair distribution too. Is there an already existing module that does this type of stuff? (create N workers at start time and post work to them)
If you use the built in cluster module, you can spin up N workers and pass work between them through messages.
https://gist.github.com/jpoehls/2232358

Distributing topics between worker instances with minimum overlap

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?
Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.
RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

Node/Express: running specific CPU-instensive tasks in the background

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?
The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

How does the cluster module work in Node.js?

Can someone explain in detail how the core cluster module works in Node.js?
How the workers are able to listen to a single port?
As far as I know that the master process does the listening, but how it can know which ports to listen since workers are started after the master process? Do they somehow communicate that back to the master by using the child_process.fork communication channel? And if so how the incoming connection to the port is passed from the master to the worker?
Also I'm wondering what logic is used to determine to which worker an incoming connection is passed?
I know this is an old question, but this is now explained at nodejs.org here:
The worker processes are spawned using the child_process.fork method,
so that they can communicate with the parent via IPC and pass server
handles back and forth.
When you call server.listen(...) in a worker, it serializes the
arguments and passes the request to the master process. If the master
process already has a listening server matching the worker's
requirements, then it passes the handle to the worker. If it does not
already have a listening server matching that requirement, then it
will create one, and pass the handle to the worker.
This causes potentially surprising behavior in three edge cases:
server.listen({fd: 7}) -
Because the message is passed to the master,
file descriptor 7 in the parent will be listened on, and the handle
passed to the worker, rather than listening to the worker's idea of
what the number 7 file descriptor references.
server.listen(handle) -
Listening on handles explicitly will cause the
worker to use the supplied handle, rather than talk to the master
process. If the worker already has the handle, then it's presumed that
you know what you are doing.
server.listen(0) -
Normally, this will cause servers to listen on a
random port. However, in a cluster, each worker will receive the same
"random" port each time they do listen(0). In essence, the port is
random the first time, but predictable thereafter. If you want to
listen on a unique port, generate a port number based on the cluster
worker ID.
When multiple processes are all accept()ing on the same underlying
resource, the operating system load-balances across them very
efficiently. There is no routing logic in Node.js, or in your program,
and no shared state between the workers. Therefore, it is important to
design your program such that it does not rely too heavily on
in-memory data objects for things like sessions and login.
Because workers are all separate processes, they can be killed or
re-spawned depending on your program's needs, without affecting other
workers. As long as there are some workers still alive, the server
will continue to accept connections. Node does not automatically
manage the number of workers for you, however. It is your
responsibility to manage the worker pool for your application's needs.
NodeJS uses a round-robin decision to make load balancing between the child processes. It will give the incoming connections to an empty process, based on the RR algorithm.
The children and the parent do not actually share anything, the whole script is executed from the beginning to end, that is the main difference between the normal C fork. Traditional C forked child would continue executing from the instruction where it was left, not the beginning like NodeJS. So If you want to share anything, you need to connect to a cache like MemCache or Redis.
So the code below produces 6 6 6 (no evil means) on the console.
var cluster = require("cluster");
var a = 5;
a++;
console.log(a);
if ( cluster.isMaster){
worker = cluster.fork();
worker = cluster.fork();
}
Here is a blog post that explains this
As an update to #OpenUserX03's answer, nodejs has no longer use system load-balances but use a built in one. from this post:
To fix that Node v0.12 got a new implementation using a round-robin algorithm to distribute the load between workers in a better way. This is the default approach Node uses since then including Node v6.0.0

Resources