What's the limit of spawning child_processes? - node.js

I have to serve a calculation via algorithm, I've been advised to use a child process per each opened socket, what I am about to do is something like that:
var spawn = require('child_process').spawn;
var child = spawn('node', ['algorithem.js']);
I know how to send argument to the algorithm process and how to receive results.
What I am concerned about, is how many socket (each socket will spawn a process) I can have?
How can I resolve this with my cloud hosting provider? so that my app gets auto scaled?
What's the recommended node js cloud hosting provider?
Finally, is this a good approach in using child processes?

Yes, this is a fair approach when you have to do some heavy processing in node. However, starting a new process introduces some overhead, so be aware. The number of sockets (file descriptors) you can open is limited by your operating system. On Linux, the limits can seen using for example the ulimit-utility.
One alternative approach, that would remove the number of sockets/processes worry, is to run a separate algorithm/computation-server. This server could spawn N worker threads and would listen on a socket. When a computation request is received, this can for example be queued and processed by the first available thread. An advantage of this approach is that your computation server can run on any machine, freeing up resources for your node instance.

Related

Worker thread communication protocol

I stumbled upon worker threads in NodeJS and I started investigating on lower level abstraction how intercommunication and data sharing works between them, especially postMessage function that is being used to send message data between threads.
Looking at this line of code const { Worker, isMainThread, parentPort } = require('worker_threads'); one would guess that it uses sockets in order to communicate as keyword port is being used, but I found no open port connections when searching them trough command prompt.
I want to understand what communication protocol is worker_thread mechanism using? Is it TCP or its some other mechanism of sharing data and messages in between threads? This is based on a research that I want to commit myself in order to understand efficiency of transmitting large amount of data in between worker_threads versus ICP communication between child processes using memory sharing/TCP.
Workers don't communicate with their parents with any sort of TCP / IP or other interprocess communication protocol. The messages passed between workers, and between workers and parents, pass data back and forth directly. Workers and their parents share a single address space and V8 instance.
.postMessage() looks like this, where transferList is optional.
port.postMessage(value, transferList)
The items in the first parameter are cloned, and copies passed from the sender to the receiver of the message.
The items in the second parameter are passed without cloning them. Only certain array-style data types can be transferred this way. The sender loses access to these items and the receiver gains access. This saves the cloning time and makes it much quicker to pass large data structures like images. This sort of messaging works in both browser and nodejs code.
Child processes can pass data back and forth from the spawned process to the spawning process. That data is copied and pushed through the interprocess communication method set up at spawning time. In many cases the IPC mechanism uses OS-level pipes; those are as efficient as any IPC mechanism, especially on UNIX-derived OSs. Child processes do not share memory with parents, so they can't use a transferList change-of-ownership scheme directly.

nodejs cluster distributing connection

In nodejs api doc, it says
The cluster module supports two methods of distributing incoming
connections.
The first one (and the default one on all platforms except Windows),
is the round-robin approach, where the master process listens on a
port, accepts new connections and distributes them across the workers
in a round-robin fashion, with some built-in smarts to avoid
overloading a worker process.
The second approach is where the master process creates the listen
socket and sends it to interested workers. The workers then accept
incoming connections directly.
The second approach should, in theory, give the best performance. In
practice however, distribution tends to be very unbalanced due to
operating system scheduler vagaries. Loads have been observed where
over 70% of all connections ended up in just two processes, out of a
total of eight.
I know PM2 is using the first one, but why it doesn't use the second? Just because of unbalnced distribution? thanks.
The second may add CPU load when every child process is trying to 'grab' the socket master sent.

How does the cluster module work in Node.js?

Can someone explain in detail how the core cluster module works in Node.js?
How the workers are able to listen to a single port?
As far as I know that the master process does the listening, but how it can know which ports to listen since workers are started after the master process? Do they somehow communicate that back to the master by using the child_process.fork communication channel? And if so how the incoming connection to the port is passed from the master to the worker?
Also I'm wondering what logic is used to determine to which worker an incoming connection is passed?
I know this is an old question, but this is now explained at nodejs.org here:
The worker processes are spawned using the child_process.fork method,
so that they can communicate with the parent via IPC and pass server
handles back and forth.
When you call server.listen(...) in a worker, it serializes the
arguments and passes the request to the master process. If the master
process already has a listening server matching the worker's
requirements, then it passes the handle to the worker. If it does not
already have a listening server matching that requirement, then it
will create one, and pass the handle to the worker.
This causes potentially surprising behavior in three edge cases:
server.listen({fd: 7}) -
Because the message is passed to the master,
file descriptor 7 in the parent will be listened on, and the handle
passed to the worker, rather than listening to the worker's idea of
what the number 7 file descriptor references.
server.listen(handle) -
Listening on handles explicitly will cause the
worker to use the supplied handle, rather than talk to the master
process. If the worker already has the handle, then it's presumed that
you know what you are doing.
server.listen(0) -
Normally, this will cause servers to listen on a
random port. However, in a cluster, each worker will receive the same
"random" port each time they do listen(0). In essence, the port is
random the first time, but predictable thereafter. If you want to
listen on a unique port, generate a port number based on the cluster
worker ID.
When multiple processes are all accept()ing on the same underlying
resource, the operating system load-balances across them very
efficiently. There is no routing logic in Node.js, or in your program,
and no shared state between the workers. Therefore, it is important to
design your program such that it does not rely too heavily on
in-memory data objects for things like sessions and login.
Because workers are all separate processes, they can be killed or
re-spawned depending on your program's needs, without affecting other
workers. As long as there are some workers still alive, the server
will continue to accept connections. Node does not automatically
manage the number of workers for you, however. It is your
responsibility to manage the worker pool for your application's needs.
NodeJS uses a round-robin decision to make load balancing between the child processes. It will give the incoming connections to an empty process, based on the RR algorithm.
The children and the parent do not actually share anything, the whole script is executed from the beginning to end, that is the main difference between the normal C fork. Traditional C forked child would continue executing from the instruction where it was left, not the beginning like NodeJS. So If you want to share anything, you need to connect to a cache like MemCache or Redis.
So the code below produces 6 6 6 (no evil means) on the console.
var cluster = require("cluster");
var a = 5;
a++;
console.log(a);
if ( cluster.isMaster){
worker = cluster.fork();
worker = cluster.fork();
}
Here is a blog post that explains this
As an update to #OpenUserX03's answer, nodejs has no longer use system load-balances but use a built in one. from this post:
To fix that Node v0.12 got a new implementation using a round-robin algorithm to distribute the load between workers in a better way. This is the default approach Node uses since then including Node v6.0.0

is node.js a one process server?

is node.js a one process server, or can it emulate Apache bunch of child processes, each serves a different request, and each is independent from the other (and the cycling of child processes to avoid long term memory leaks).
Is it at all needed when using node.js?
Node.js by default is a one process server. For most purposes that's all that's needed (IO limits and memory limits are typically reached before CPU limits).
If you need more processes you can use http://learnboost.github.com/cluster/
It's single process and single threaded, due to the fact that Node is non-blocking and event-based. This means this single process can handle many requests at the same time, sending a response back whenever the response is ready.
The key point to note, is that Node is non-blocking.

Long-running computations in node.js

I'm writing a game server in node.js, and some operations involve heavy computation on part of the server. I don't want to stop accepting connections while I run those computations -- how can I run them in the background when node.js doesn't support threads?
I can't vouch for either of these, personally, but if you're hell-bent on doing the work in-process, there have been a couple of independent implementations of the WebWorkers API for node, as listed on the node modules page:
http://github.com/cramforce/node-worker
http://github.com/pgriess/node-webworker
At first glance, the second looks more mature, and these would both allow you to essentially do threaded programming, but it's basically actor-model, so it's all done with message passing, and you can't have shared data structures or anything.
Also, for what it's worth, the node.js team intends to implement precisely this API natively, eventually, so these tools, even if they're not perfect, may be a decent stopgap.
var spawn = require('child_process').spawn;
listorwhatev = spawn('ls', ['-lh', '/usr']);//or whatever server action you need
//then you can attach events to that list like this
listorwhatev.on('exit', function(code){});
///or in this ls example as it streams info
listorwhatev.stdout.on('data', function(info){sys.puts(info);});
ensure spawn process occurs once per application then feed stuff into it and watch it for events per connection.
you should also check that the listorwhatev is still running before handling it. As we all love those uncaught errors in node crashing the app don't we ;)
When the spawn (pid) is exited though a kill or something bad happens on your machine and you didn't exit the spawn in your code gracefully, your stream event handler will crash your app.
and some operations involve heavy
computation on part of the server
How did you write code that is computation heavy in the first place. That's very hard to do in node.js.
how can I run them in the background
when node.js doesn't support threads
you could spawn a couple of worker(node) instances and communicate with accepting connections(node instance) using for example redis blocking pop. Node.js redis library is none blocking.

Resources