Node child_process vs Heroku worker

Node child_process vs Heroku worker - node.js

Trying to get my mind around workers vs threads on Node and Heroku. What happens when you call exec from within Node.js?
Is it correct to believe this runs on a separate thread, and does not block the main event loop?
require('child_process').exec(cmd, function (err, stdout, stderr) {
// ... do stuff
});
If on Heroku, is there an advantage to moving this to a separate worker? E.g.
If computational intensive, would child_process slow the main app?
Do worker dynos get their own memory limit?
Would an uncaught error (heaven forbid) not crash the main app if in a worker?

In Node, a child process is a real separate process on the CPU, which is a child of your parent process (your Node.js script). This article explains this in much more depth here: http://www.graemeboy.com/node-child-processes
What this means on Heroku, is that if you use child_process to spawn a new child process, your Heroku dyno will actually be able to do 'more' total CPU work, as it will be running your child process code (most likely) on a separate physical CPU (this is very dependent on a lot of factors in your application, however).
This can be a problem, however, because each Heroku dyno only has a limited amount of CPU and RAM resources.
So for instance, if your Dyno code (the web bit, not a separate Heroku worker) is doing CPU intensive stuff and using child_process a lot, you will use up all your CPU resources and your code will start to block / hang in Node.
A much better idea (although slightly more expensive on Heroku) is to put all worker / asynchronous code into a separate worker dyno, and use that EXCLUSIVELY for processing CPU intensive stuff. This ensures your main web dynos will always be fast and responsive, as much as possible.
I personally like to use a queueing service like Amazon SQS to handle passing data between my web dynos and my worker dynos, as it's super fast an inexpensive, but you have lots of options.
Every dyno you create (web dynos and worker dynos) get their own resources, so each dyno gets it's own set amount of CPU and RAM. The types of dynos available, and their resource limits, and explained here: https://devcenter.heroku.com/articles/dyno-types
In regards to error handling, if you don't catch an exception, it's tricky to say what will happen. It is, however, very possible that your entire Node app will crash (and then Heroku will just restart it). It really depends on your specific implementation of various things =/

Related

Kubernetes NodeJS consuming more than 1 CPU?

We are running a single NodeJS instance in a Pod with a request of 1 CPU, and no limit. Upon load testing, we observed the following:
NAME CPU(cores) MEMORY(bytes)
backend-deployment-5d6d4c978-5qvsh 3346m 103Mi
backend-deployment-5d6d4c978-94d2z 3206m 99Mi
If NodeJS is only running a single thread, how could it be consuming more than 1000m CPU, when running directly on a Node it would only utilize a single core? Is kubernetes somehow letting it borrow time across cores?

Although Node.js runs the main application code in a single thread, the Node.js runtime is multi-threaded. Node.js has an internal worker pool that is used to run background tasks, including I/O and certain CPU-intensive processing like crypto functions. In addition, if you use the worker_threads facility (not to be confused with the worker pool), then you would be directly accessing additional threads in Node.js.

Node Worker Threads vs Heroku Workers

I'm trying to understand difference between Node Worker Threads vs Heroku Workers.
We have a single Dyno for our main API running Express.
Would it make sense to have a separate worker Dyno for our intensive tasks such as processing a large file.
worker: npm run worker
Some files we process are up to 20mb and some processes take longer than the 30s limit to run so kills the connection before it comes back.
Then could I add Node Worker Threads in the worker app to create child processes to handle the requests or is the Heroku worker enough on its own?

After digging much deeper into this and successfully implementing workers to solve the original issue, here is a summary for anyone who comes across the same scenario.
Node worker threads and Heroku workers are similar in that they intend to run code on separate threads in Node that do not block the main thread. How you use and implement them differs and depends on use case.
Node worker threads
These are the new way to create clustered environments on NODE. You can follow the NODE docs to create workers or use something like microjob to make it much easier to setup and run separate NODE threads for specific tasks.
https://github.com/wilk/microjob
This works great and will be much more efficient as they will run on separate worker threads preventing I/O blocking.
Using worker threads on Heroku on a Web process did not solve my problem as the Web process still times out after a query hits 30s.
Important difference: Heroku Workers Do not!
Heroku Workers
These are separate virtual Dyno containers on Heroku within a single App. They are separate processes that run without all the overhead the Web process runs, such as http.
Workers do not listen to HTTP requests. If you are using Express with NODE you need a web process to handle incoming http requests and then a Worker to handle the jobs.
The challenge was working out how to communicate between the web and worker processes. This is done using Redis and Bull Query together to store data and send messages between the processes.
Finally, Throng makes it easier to create a clustered environment using a Procfile, so it is ideal for use with Heroku!
Here is a perfect example that implements all of the above in a starter project that Heroku has made available.
https://devcenter.heroku.com/articles/node-redis-workers

It may make more sense for you to keep a single dyno and scale it up, which means multiple instances will be running in parallel.
See https://devcenter.heroku.com/articles/scaling

How do 'cluster' and 'worker_threads' work in Node.js?

Did I understand correctly: If I use cluster package, does it mean that
a new node instance is created for each created worker?
What is the difference between cluster and worker_threads packages?

Effectively what you are differing is process based vs thread based. Threads share memory (e.g. SharedArrayBuffer) whereas processes don't. Essentially they are the same thing categorically.
cluster
One process is launched on each CPU and can communicate via IPC.
Each process has it's own memory with it's own Node (v8) instance. Creating tons of them may create memory issues.
Great for spawning many HTTP servers that share the same port b/c the master main process will multiplex the requests to the child processes.
worker threads
One process total
Creates multiple threads with each thread having one Node instance (one event loop, one JS engine). Most Node API's are available to each thread except a few. So essentially Node is embedding itself and creating a new thread.
Shares memory with other threads (e.g. SharedArrayBuffer)
Great for CPU intensive tasks like processing data or accessing the file system. Because NodeJS is single threaded, synchronous tasks can be made more efficient with workers

Node.js child process limits

I know that node is a single threaded system and I was wondering if a child process uses its own thread or its parents. say for example I have an amd E-350 cpu with two threads. if I ran a node server that spawned ten child instances which all work continuously. would it allow it or would it fail as the hardware itself is not sufficient enough?

I can say from own experience that I successfully spawned 150 child processes inside an Amazon t2.micro with just one core.
The reason? I was DoS-ing myself for testing my core server's limits.
The attack stayed alive for 8 hours, until I gave up, but it could've been working for much longer.
My code was simply running an HTTP client pool and as soon as one request was done, another one spawned. This doesn't need a lot of CPU. It needs lots of network, though.
Most of the time, the processes were just waiting for requests to finish.
However, in a high-concurrency application, the performance will be awful if you share memory between so many processes.

Node.js Clusters with Additional Processes

We use clustering with our express apps on multi cpu boxes. Works well, we get the maximum use out of AWS linux servers.
We inherited an app we are fixing up. It's unusual in that it has two processes. It has an Express API portion, to take incoming requests. But the process that acts on those requests can run for several minutes, so it was build as a seperate background process, node calling python and maya.
Originally the two were tightly coupled, with the python script called by the request to upload the data. But this of course was suboptimal, as it would leave the client waiting for a response for the time it took to run, so it was rewritten as a background process that runs in a loop, checking for new uploads, and processing them sequentially.
So my question is this: if we have this separate node process running in the background, and we run clusters which starts up a process for each CPU, how is that going to work? Are we not going to get two node processes competing for the same CPU. We were getting a bit of weird behaviour and crashing yesterday, without a lot of error messages, (god I love node), so it's bit concerning. I'm assuming Linux will just swap the processes in and out as they are being used. But I wonder if it will be problematic, and I also wonder about someone getting their web session swapped out for several minutes while the longer running process runs.
The smart thing to do would be to rewrite this to run on two different servers, but the files that maya uses/creates are on the server's file system, and we were not given the budget to rebuild the way we should. So, we're stuck with this architecture for now.
Any thoughts now possible problems and how to avoid them would be appreciated.

From an overall architecture prospective, spawning 1 nodejs per core is a great way to go. You have a lot of interdependencies though, the nodejs processes are calling maya which may use mulitple threads (keep that in mind).
The part that is concerning to me is your random crashes and your "process that runs in a loop". If that process is just checking the file system you probably have a race condition where the nodejs processes are competing to work on the same input/output files.
In theory, 1 nodejs process per core will work great and should help to utilize all your CPU usage. Linux always swaps the processes in and out so that is not an issue. You could start multiple nodejs per core and still not have an issue.
One last note, be sure to keep an eye on your memory usage, several linux distributions on EC2 do not have a swap file enabled by default, running out of memory can be another silent app killer, best to add a swap file in case you run into memory issues.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string