Node.js multithreading and asynchronous - node.js

I'm a little confused with multithreading and asynchronous in js. What is the difference between a cluster, a stream, a child process, and a worker thread?

The first thing to remember about multithreading in Node.js is that in user-space, there exists no concept of threading, and as such you cannot write any code making use of threads. Any node program is always a single threaded program (in user-space).
Since a node program is a single thread, and runs as a single process, it uses only a single CPU. Most modern processors have multiple CPUs, and in order to make use of all of these CPUs and provide better throughput, you can start the same node program as a cluster.
The cluster module of node, allows you to start a node program, and the first instance launched is launched as the master instance. The master allows you to spawn new workers as separate processes (not threads) using cluster.fork() method. The actual work that is to be done by the node program is done by the workers. The example in the node docs demonstrates this perfectly.
A child process is a process that is spawned from the current process and has an established IPC channel between them to communicate with each other. The master and workers I described in cluster are an example of child processes. the child_process module in node allows you to spawn custom child processes as you require.
Streams are something that is not at all related to multi-threading or multiple processes. Streams are just a way to handle large amounts of data without loading all the data into the working memory at the same time. Ex: Consider you want to read a 10GB log file, and your server only has 4GB of memory. Trying to load the file using fs.readFile will crash your process. Instead you use fs.createReadStream and use that to process the file in smaller chunks that can be loaded into memory.
Hope this explains. For further details you really should read the node docs.

this is a little vague so I'm just gonna give an overview.
Streams are really just data streams like in any other language. Similar to iostreams in C and where you get user input, or other types of data. They're usually masked by another class so you don't know you're using a stream. You won't mess with these unless you're building a new type usually.
Child processes, worker threads, and clusters are all ways of utilizing multi-core processing in Node applications.
Worker threads are basic multithreading the Node way, with each thread having a way to communicate with the parent, and shared memory possible between each thread. You pass in a function and data, and can provide a callback for when the thread is done processing.
Clusters are more for network sharing. Often used behind a master listener port, a master app will listen for connections, then assign them in a round-robin manner to each cluster thread for use. They share the server port(s) across multiple processors to even out the load.
Child processes are a way to create a new process in a similar way to through popen. These can be asynchronous or synchronous (non-blocking or blocking the Node event loop), and can send to and receive from the parent process via stdout/stderr and stdin, respectively. The parent can register listeners to each child process for updates. You can pass a file, a function, or a module to a child process. Generally do not share memory.
I'd suggest reading the documentation yourself and coming back with any specific questions you have, you won't get much with vague questions like this, makes it seem like you didn't do your own part of the work beforehand.
Documentation:
Streams
Worker Threads
Clusters
Child Processes

Related

When a workerThread is created in nodejs, does it utilize the same core in which nodejs process is running?

Let's assume i have a nodejs serverProgram with one api and it does some manipulations on the video file, sent via the http request.
const saveVideoFile=(req,res)=>{
processAndSaveVideoFile(); // can run for minimum of 10 minutes
res.send({status: "video is being processed"})
}
i decided to to make use of a workerThread to do this processing as my machine has 3 cores (core1,core2,core3) and there is no hyperthreading enabled here
Assume that my nodejs program is running on core1. When i fire up a single workerThread, will the workerThread run on core2/core3 or core1?
i read that workerThread is not the same as childProcess. ChildProcess will fork a new process which will facilitate the childProcess to choose from available free cores (core2 or core3).
i read that workerThread shares memory with the mainThread. Let's assume that i create 2 workerThreads (wt1,wt2). Will my nodejs program, wt1, wt2 run on the same core i.e core1 ?
Also, in nodejs we have eventloop (mainthread) and otherThreads doing the background operations i.e I/O. is it correct to assume that all of these are utilizing the resources available in a single core (core1). if this is the case, is creating and using additional workerThread's an overkill on the nodejs server?
Below is an excerpt from this blog
We can run things in parallel in Node.js. However, we need not to
create threads. The operating system and the virtual machine
collectively run the I/O in parallel and the JS code then runs in a
single thread when it is time to send the data back to the JavaScript
code.
i keep reading this same information about nodejs in many articles and video presentations. But what i do not understand is this,
The operating system and the virtual machine collectively run the I/O in parallel
How can the operating system run the I/O requests from nodejs program in parallel without using any of the childProcess or threads spawned from nodejs? if those I/O requests from nodejs program is running in parallel, does it mean that all 3 cores (core1,core2,core3) will be utilized?
There are lot of contents on nodejs, but it doesn't clear doubts related to my above questions. if you have idea on how these things actually work, please share the detail.
A worker thread in node.js is an actual OS thread running in a different instance of V8. As such, it's totally up to the operating system to decide how to allocate it among available CPU cores. If there are cores with available time, then it will not generally be run on the same core as the main nodejs thread when that thread is busy because the OS will allocate busy threads across the various cores.
But, again this is entirely up to the OS and is not something that nodejs controls and the exact strategy for which cores are used will vary by OS. But, in all modern operating systems, the design goal is that available cores are used for threads that are currently executing. Now, if there are more threads active at once than there are cores, the threads will be time-sliced and all the cores will be active.
Also, in nodejs we have eventloop (mainthread) and otherThreads doing the background operations i.e I/O. is it correct to assume that all of these are utilizing the resources available in a single core (core1). if this is the case, is creating and using additional workerThread's an overkill on the nodejs server?
No, it is not correct to assume those threads all use the same core.
A workerThread in nodejs has its own event loop. For the most part, it does not share memory. In fact, if you want to share memory, you have to very specifically allocated SharedMemory and pass that to the workerThread.
Is it overkill? Well, it depends upon what you're doing. There are very useful things to do with workerThreads and there are things that they would not be necessary for.
The operating system and the virtual machine collectively run the I/O in parallel
I/O in node.js is either asynchronous at the OS level (such as networking) or run in separate threads (such as disk I/O). That means it runs separately from the main thread in node.js that runs your Javascript and can run in parallel with it, synchronizing only at the completion of an event. "Parallel" in this case means that both make progress at the same time. If there are multiple cores, then they can truly be running at exactly the same time. If there was only one core, then the OS will timeslice between the various threads and they will be both make progress (in an interleaved fashion that will seem to be parallel, but really they are taking turns).
How can the operating system run the I/O requests from nodejs program in parallel without using any of the childProcess or threads spawned from nodejs? if those I/O requests from nodejs program is running in parallel, does it mean that all 3 cores (core1,core2,core3) will be utilized?
The OS has its own threads for managing things like a network interface or a disk interface. The job of those threads is to interface with the hardware and bring data to an appropriate application or take data from the application and send it to the hardware. These are OS-level threads that exists independent of node.js. Yes, other cores can be used by those OS-level threads. It is important to realize that many operations such as networking are inherently non-blocking. Thus, if you're waiting for some data to arrive on a network interface, you don't need to have a thread doing something the whole time.
I want to add that it appears in your questions that you've combined questions about a several different things. Mentioned in your questions are:
Worker Threads
Internal node.js threads
Operating system threads
These are all different things.
A worker thread is a new thread you can start to run specific pieces of Javascript in another thread so you can have more than one Javascript thread running at the same time. In node.js, this is done by creating a whole new instance of V8, setting up a whole new global environment and loaded modules environment and using almost entirely separate memory.
Internal node.js threads are used by node.js as part of implementing its event loop and its standard library. Specifically, disk I/O and some crypto operations are run in internal native threads and they communicate with your Javascript via events/callbacks through the event loop.
Operating system threads are threads that the OS uses to implement it's own system APIs. Since the OS is responsible for lots of things, these threads ca have many different uses. Depending upon native implementations, they may be used to facilitate things like disk I/O or networking I/O. These threads are the responsibility of the OS to create and use and are not directly controlled by node.js.
Some additional questions asked in comments:
what is the difference b/w workerThread & childProcess concept in nodejs? is childProcess = workerThread without sharedMemory ?
A child process can be any type of program - it does not have to be a node.js program. A worker thread is node.js code.
A worker thread can share memory if sharedMemory is specifically allocated and shared with the worker thread and if it is carefully managed for concurrency issues.
It is more efficient to copy memory back and forth between worker thread and main thread than with child process.
If main program exits, worker threads will exit. If main program exits, child process can be configured to exit or to continue.
If worker thread calls process.exit(), the main thread will exit too. If child program exits, it cannot cause main program to exit without main program's cooperation.
how nodejs is able to magically interact with the os level thread without nodejs itself creating any threads?, i need additional details on this, your explanation is the common one present in most places including the blog i shared?
nodejs just calls an OS API. It's the OS API that manages communicating with its own threads (if threads are needed for that specific OS API). How it does that communication internally is implementation dependent and will vary by OS. It will even vary by OS which OS APIs use threads and which don't.

NodeJS Clustering and Worker Threads

I am doing some research for a home project and I'm looking into the Cluster Module and Worker Threads.
I know the difference between Cluster and Worker Threads.
My question is:
In NodeJS is it possible to use Clustering and Worker Threads at the same time?
I'm guessing you're thinking of the worker threads module when you're referring to "worker threads". I'm making a clear distinction since NodeJS runtime comes with 4 worker threads by default. cluster (module) and worker_threads shouldn't have any problems working in parallel as clustering provides you with multiple independent NodeJS instances, as in, multiple NodeJS processes which have their own threads as stated before. Spawning more worker threads using the above mentioned worker_threads module spawns more threads which aren't independent (they can and do share memory), which can be good if you're doing some crunching, but they all run under a single Node process.
Processes can communicate using IPC, and Node worker threads can talk using the MessagePort class from the same module.
Therefore, yes, you can do that and my best guess is that, on the top of the "supervision tree" of sorts you spawn a couple of Node processes (using the cluster module) to distribute the load if you have them acting as servers (no clue about your use case), and then for each process you can use the worker_threads module to spawn additional threads if needed (to speed up some heavy processing etc).

How do 'cluster' and 'worker_threads' work in Node.js?

Did I understand correctly: If I use cluster package, does it mean that
a new node instance is created for each created worker?
What is the difference between cluster and worker_threads packages?
Effectively what you are differing is process based vs thread based. Threads share memory (e.g. SharedArrayBuffer) whereas processes don't. Essentially they are the same thing categorically.
cluster
One process is launched on each CPU and can communicate via IPC.
Each process has it's own memory with it's own Node (v8) instance. Creating tons of them may create memory issues.
Great for spawning many HTTP servers that share the same port b/c the master main process will multiplex the requests to the child processes.
worker threads
One process total
Creates multiple threads with each thread having one Node instance (one event loop, one JS engine). Most Node API's are available to each thread except a few. So essentially Node is embedding itself and creating a new thread.
Shares memory with other threads (e.g. SharedArrayBuffer)
Great for CPU intensive tasks like processing data or accessing the file system. Because NodeJS is single threaded, synchronous tasks can be made more efficient with workers

How does Cluster keeps up with Node's single thread concept?

When you fork, or start multiple workers using something like Cluster:
Are multiple threads or instances of Node process being created ? Does this breaks Node's single thread concept?
How are the request handled between workers? Does Cluster provides some intelligent mechanism to load balance all requests to multiple workers ?
Cluster uses fork, and yes, it gets balanced automatically:
The worker processes are spawned using the child_process.fork method, so that they can communicate with the parent via IPC and pass server handles back and forth.
[...]
When multiple processes are all accept()ing on the same underlying resource, the operating system load-balances across them very efficiently. There is no routing logic in Node.js, or in your program, and no shared state between the workers. Therefore, it is important to design your program such that it does not rely too heavily on in-memory data objects for things like sessions and login.
You might think that this breaks node.js single thread concept if you count a new node.js instance as another thread, however, keep in mind that all callbacks to a given request are going to be handled be the same node.js instance that accepted the original request. There are no race conditions, no shared data, only fairly safe interprocess communication.
See the Cluster documentation for more information.
Cluster was made developed to compensate of node.js's single thread architecture. Modern processors have multiple cores and a single threaded process will not be able to take advantage of the available cores. It does deviate from its single thread architecture, but it was never the plan to stick to it. The main concept was asynchronous, event-driven execution.
Cluster uses fork to create processes. A forked process really is its
own process with its own address space - there is nothing that the
child can do (normally) to affect its parent's or siblings address
space (unlike a thread). In addition to having all the methods in a
normal ChildProcess instance, the returned object has a communication
channel built-in. All forked processes can communicate using this
channel.
Notice the subtle difference here : it is not multi-threaded, it just forks to create new independent processes. See here Threads vs Processes in Linux to compare them. Each worker assumes single-threaded architecture like before. So it does not break node's single thread concept.
The balancing of load depends on your code itself (since each is independent) and the OS. The load is balanced equally among all forked processes and original process alike, by the OS.
But if you wish to do it differently, it is also possible. If you use master thread differently than worker, or each worker specializing different tasks(compressing/ffmpeg) you can do that.

What are the effective differences between child_process.fork and cluster.fork?

I understand that cluster.fork will allow for multiple processes to listen on the same port(s), what I also want to know is how much additional overhead is there in supporting this when some of your workers are not listeners/handlers for the tcp service?
I have a service that I also want to launch a couple of workers.. ex: 2 web service listener processes, and 3 worker instances. Is it best to use cluster for them all, or would cluster for the 2 web services, and child_process for the workers be better?
I don't know the internals in node, but think it would be nice for myself and others to have a better understanding of which route to take given different needs. For now, I'm using cluster for all the processes.
cluster.fork is implemented on top of child_process.fork. The extra stuff that cluster.fork brings is that, it will enable you to listen on a shared port. If you don't want it, just use child_process.fork. So yeah, use cluster for web servers and child_process for workers.
Cluster is a module of Node.js that contains sets of functions and properties that helps the developers for forking processes through which they can take advantage of the multi-core system.
With the cluster module, the creation and sharing of child processes and several parts become easy. In a single thread, the individual instance of node.js runs specifically and to take advantage of various ecosystems, a cluster of node.js is launched, to distribute the load.
A developer can access Operating System functionalities by the child_process module, this happens by running any system command inside a child process. The child process input streams can be controlled and the developer can also listen to the output stream.
A child process can be easily spun using Node’s child_process module and these child processes can easily communicate with each other with the help of a messaging system

Resources