NodeJS Clustering and Worker Threads

NodeJS Clustering and Worker Threads - node.js

I am doing some research for a home project and I'm looking into the Cluster Module and Worker Threads.
I know the difference between Cluster and Worker Threads.
My question is:
In NodeJS is it possible to use Clustering and Worker Threads at the same time?

I'm guessing you're thinking of the worker threads module when you're referring to "worker threads". I'm making a clear distinction since NodeJS runtime comes with 4 worker threads by default. cluster (module) and worker_threads shouldn't have any problems working in parallel as clustering provides you with multiple independent NodeJS instances, as in, multiple NodeJS processes which have their own threads as stated before. Spawning more worker threads using the above mentioned worker_threads module spawns more threads which aren't independent (they can and do share memory), which can be good if you're doing some crunching, but they all run under a single Node process.
Processes can communicate using IPC, and Node worker threads can talk using the MessagePort class from the same module.
Therefore, yes, you can do that and my best guess is that, on the top of the "supervision tree" of sorts you spawn a couple of Node processes (using the cluster module) to distribute the load if you have them acting as servers (no clue about your use case), and then for each process you can use the worker_threads module to spawn additional threads if needed (to speed up some heavy processing etc).

Related

When is better using clustering or worker_threads?

I have been reading about multi-processing on NodeJS to get the best understanding and try to get a good performance in heavy environments with my code.
Although I understand the basic purpose and concept for the different ways to take profit of the resources to handle the load, some questions arise as I go deeper and it seems I can't find the particular answers in the documentation.
NodeJS in a single thread:
NodeJS runs a single thread that we call event loop, despite in background OS and Libuv are handling the default worker pool for I/O asynchronous tasks.
We are supossed to use a single core for the event-loop, despite the workers might be using different cores. I guess they are sorted in the end by OS scheduler.
NodeJS as multi-threaded:
When using "worker_threads" library, in the same single process, different instances of v8/Libuv are running for each thread. Thus, they share the same context and communicate among threads with "message port" and the rest of the API.
Each worker thread runs its Event loop thread. Threads are supposed to be wisely balanced among CPU cores, improving the performance. I guess they are sorted in the end by OS scheduler.
Question 1: When a worker uses I/O default worker pool, are the very same
threads as other workers' pool being shared somehow? or each worker has its
own default worker pool?
NodeJS in multi-processing:
When using "cluster" library, we are splitting the work among different processes. Each process is set on a different core to balance the load... well, the main event loop is what in the end is set in a different core, so it doesn't share core with another heavy event loop. Sounds smart to do it that way.
Here I would communicate with some IPC tactic.
Question 2: And the default worker pool for this NodeJS process? where
are they? balanced among the rest of cores as expected in the first
case? Then they might be on the same cores as the other worker pools
of the cluster I guess. Shouldn't it be better to say that we are balancing main threads (event loops) rather than "the process"?
Being all this said, the main question:
Question 3: Whether is better using clustering or worker_threads? If both are being used in the same code, how can both libraries agree the best performance? or they
just can simply get in conflict? or at the end is the OS who takes
control?

Each worker thread has its own main loop (libuv etc). So does each cloned Node.js process when you use clustering.
Clustering is a way to load-balance incoming requests to your Node.js server over several copies of that server.
Worker threads are a way for a single Node.js process to offload long-running functions to a separate thread, to avoid blocking its own main loop.
Which is better? It depends on the problem you're solving. Worker threads are for long-running functions. Clustering makes a server able to handle more requests, by handling them in parallel. You can use both if you need to: have each Node.js cluster process use a worker thread for long-running functions.
As a first approximation for your decision-making: only use worker threads when you know you have long-running functions.
The node processes (whether from clustering or worker threads) don't get tied to specific cores (or Intel processor threads) on the host machine; the host's OS scheduling assigns cores as needed. The host OS scheduler minimize context-switch overhead when assigning cores to runnable processes. If you have too many active Javascript instances (cluster instances + worker threads) the host OS will give them timeslices according to its scheduling algorithms. Other than avoiding too many Javascript instances, there's very little point in trying second-guess the OS scheduler.
Edit Each Node.js instance, with any worker threads, uses a single libuv thread pool. A main Node.js process shares a single libuv thread pool with all its worker threads. If your Node.js program uses many worker threads, you may, or may not, need to set the UV_THREADPOOL_SIZE environment variable to a value greater than the default 4.
Node.js's cluster functionality uses the underlying OS's fork/exec scheme to create a new OS process for each cluster instance. So, each cluster instance has its own libuv pool.
If you're running stuff at scale, lets say with more than ten host machines running your Node.js server, then you can spend time optimizing Javascript instances.
Don't forget nginx if you use it as a reverse proxy to handle your https work. It needs some processor time too, but it uses fine-grain multithreading so you won't have to worry about it unless you have huge traffic.

How do 'cluster' and 'worker_threads' work in Node.js?

Did I understand correctly: If I use cluster package, does it mean that
a new node instance is created for each created worker?
What is the difference between cluster and worker_threads packages?

Effectively what you are differing is process based vs thread based. Threads share memory (e.g. SharedArrayBuffer) whereas processes don't. Essentially they are the same thing categorically.
cluster
One process is launched on each CPU and can communicate via IPC.
Each process has it's own memory with it's own Node (v8) instance. Creating tons of them may create memory issues.
Great for spawning many HTTP servers that share the same port b/c the master main process will multiplex the requests to the child processes.
worker threads
One process total
Creates multiple threads with each thread having one Node instance (one event loop, one JS engine). Most Node API's are available to each thread except a few. So essentially Node is embedding itself and creating a new thread.
Shares memory with other threads (e.g. SharedArrayBuffer)
Great for CPU intensive tasks like processing data or accessing the file system. Because NodeJS is single threaded, synchronous tasks can be made more efficient with workers

Node.js multithreading and asynchronous

I'm a little confused with multithreading and asynchronous in js. What is the difference between a cluster, a stream, a child process, and a worker thread?

The first thing to remember about multithreading in Node.js is that in user-space, there exists no concept of threading, and as such you cannot write any code making use of threads. Any node program is always a single threaded program (in user-space).
Since a node program is a single thread, and runs as a single process, it uses only a single CPU. Most modern processors have multiple CPUs, and in order to make use of all of these CPUs and provide better throughput, you can start the same node program as a cluster.
The cluster module of node, allows you to start a node program, and the first instance launched is launched as the master instance. The master allows you to spawn new workers as separate processes (not threads) using cluster.fork() method. The actual work that is to be done by the node program is done by the workers. The example in the node docs demonstrates this perfectly.
A child process is a process that is spawned from the current process and has an established IPC channel between them to communicate with each other. The master and workers I described in cluster are an example of child processes. the child_process module in node allows you to spawn custom child processes as you require.
Streams are something that is not at all related to multi-threading or multiple processes. Streams are just a way to handle large amounts of data without loading all the data into the working memory at the same time. Ex: Consider you want to read a 10GB log file, and your server only has 4GB of memory. Trying to load the file using fs.readFile will crash your process. Instead you use fs.createReadStream and use that to process the file in smaller chunks that can be loaded into memory.
Hope this explains. For further details you really should read the node docs.

this is a little vague so I'm just gonna give an overview.
Streams are really just data streams like in any other language. Similar to iostreams in C and where you get user input, or other types of data. They're usually masked by another class so you don't know you're using a stream. You won't mess with these unless you're building a new type usually.
Child processes, worker threads, and clusters are all ways of utilizing multi-core processing in Node applications.
Worker threads are basic multithreading the Node way, with each thread having a way to communicate with the parent, and shared memory possible between each thread. You pass in a function and data, and can provide a callback for when the thread is done processing.
Clusters are more for network sharing. Often used behind a master listener port, a master app will listen for connections, then assign them in a round-robin manner to each cluster thread for use. They share the server port(s) across multiple processors to even out the load.
Child processes are a way to create a new process in a similar way to through popen. These can be asynchronous or synchronous (non-blocking or blocking the Node event loop), and can send to and receive from the parent process via stdout/stderr and stdin, respectively. The parent can register listeners to each child process for updates. You can pass a file, a function, or a module to a child process. Generally do not share memory.
I'd suggest reading the documentation yourself and coming back with any specific questions you have, you won't get much with vague questions like this, makes it seem like you didn't do your own part of the work beforehand.
Documentation:
Streams
Worker Threads
Clusters
Child Processes

NodeJS in MultiCore System

"Node.js is limited to a single thread". how the nodeJS will react when we are deploying in Multi-Core systems? will it boost the performance?

The JavaScript running in the Node.js V8 engine is single-threaded, but the underlying libuv multi-platform support library is multi-threaded and those threads will be distributed across the CPU cores by the operating system according to it's scheduling algorithm, so with your JavaScript application running asynchronously (and single-threaded) at the top level, you still benefit from multi-core under the covers.
As others have mentioned, the Node.js Cluster module is an excellent way to exploit multi-core for concurrency at the application (JavaScript V8) level, and since Express is cluster aware, you can have multiple worker processes executing concurrent server logic, without needing a unique listening port for each process. Impressive.
As others have mentioned, you will need Redis or equivalent to share data among the cluster worker processes. You will also want a logging facility that is cluster aware, so the cluster master and all worker processes can log to a single shared log file. The Node log4node module is a good choice here, and it works with logrotate.
Typical web examples show using the runtime detected number of cores as the number of cluster worker processes to fork, but I prefer to make that a configuration option in a config.yaml file so I can tune the number of worker processes running the main JavaScript application as needed.

Nodejs runs in one thread, but you can start multiple nodejs processes.
If you are, for example, building web server you can route every request to one of nodejs processes.
Edit: As hereandnow78 and vkurchatkin suggested, maybe the best way to use power of multi core system would be to use nodejs cluster module

cluster module is the solution.
But u need to know that, node.js cluster is, it invokes child process. It means each process cannot share the data.
To share data, u need to use Redis or other IMDG to share the data across the cluster nodes.

How does Cluster keeps up with Node's single thread concept?

When you fork, or start multiple workers using something like Cluster:
Are multiple threads or instances of Node process being created ? Does this breaks Node's single thread concept?
How are the request handled between workers? Does Cluster provides some intelligent mechanism to load balance all requests to multiple workers ?

Cluster uses fork, and yes, it gets balanced automatically:
The worker processes are spawned using the child_process.fork method, so that they can communicate with the parent via IPC and pass server handles back and forth.
[...]
When multiple processes are all accept()ing on the same underlying resource, the operating system load-balances across them very efficiently. There is no routing logic in Node.js, or in your program, and no shared state between the workers. Therefore, it is important to design your program such that it does not rely too heavily on in-memory data objects for things like sessions and login.
You might think that this breaks node.js single thread concept if you count a new node.js instance as another thread, however, keep in mind that all callbacks to a given request are going to be handled be the same node.js instance that accepted the original request. There are no race conditions, no shared data, only fairly safe interprocess communication.
See the Cluster documentation for more information.

Cluster was made developed to compensate of node.js's single thread architecture. Modern processors have multiple cores and a single threaded process will not be able to take advantage of the available cores. It does deviate from its single thread architecture, but it was never the plan to stick to it. The main concept was asynchronous, event-driven execution.
Cluster uses fork to create processes. A forked process really is its
own process with its own address space - there is nothing that the
child can do (normally) to affect its parent's or siblings address
space (unlike a thread). In addition to having all the methods in a
normal ChildProcess instance, the returned object has a communication
channel built-in. All forked processes can communicate using this
channel.
Notice the subtle difference here : it is not multi-threaded, it just forks to create new independent processes. See here Threads vs Processes in Linux to compare them. Each worker assumes single-threaded architecture like before. So it does not break node's single thread concept.
The balancing of load depends on your code itself (since each is independent) and the OS. The load is balanced equally among all forked processes and original process alike, by the OS.
But if you wish to do it differently, it is also possible. If you use master thread differently than worker, or each worker specializing different tasks(compressing/ffmpeg) you can do that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string