I'm trying to understand difference between Node Worker Threads vs Heroku Workers.
We have a single Dyno for our main API running Express.
Would it make sense to have a separate worker Dyno for our intensive tasks such as processing a large file.
worker: npm run worker
Some files we process are up to 20mb and some processes take longer than the 30s limit to run so kills the connection before it comes back.
Then could I add Node Worker Threads in the worker app to create child processes to handle the requests or is the Heroku worker enough on its own?
After digging much deeper into this and successfully implementing workers to solve the original issue, here is a summary for anyone who comes across the same scenario.
Node worker threads and Heroku workers are similar in that they intend to run code on separate threads in Node that do not block the main thread. How you use and implement them differs and depends on use case.
Node worker threads
These are the new way to create clustered environments on NODE. You can follow the NODE docs to create workers or use something like microjob to make it much easier to setup and run separate NODE threads for specific tasks.
https://github.com/wilk/microjob
This works great and will be much more efficient as they will run on separate worker threads preventing I/O blocking.
Using worker threads on Heroku on a Web process did not solve my problem as the Web process still times out after a query hits 30s.
Important difference: Heroku Workers Do not!
Heroku Workers
These are separate virtual Dyno containers on Heroku within a single App. They are separate processes that run without all the overhead the Web process runs, such as http.
Workers do not listen to HTTP requests. If you are using Express with NODE you need a web process to handle incoming http requests and then a Worker to handle the jobs.
The challenge was working out how to communicate between the web and worker processes. This is done using Redis and Bull Query together to store data and send messages between the processes.
Finally, Throng makes it easier to create a clustered environment using a Procfile, so it is ideal for use with Heroku!
Here is a perfect example that implements all of the above in a starter project that Heroku has made available.
https://devcenter.heroku.com/articles/node-redis-workers
It may make more sense for you to keep a single dyno and scale it up, which means multiple instances will be running in parallel.
See https://devcenter.heroku.com/articles/scaling
Related
We are running a single NodeJS instance in a Pod with a request of 1 CPU, and no limit. Upon load testing, we observed the following:
NAME CPU(cores) MEMORY(bytes)
backend-deployment-5d6d4c978-5qvsh 3346m 103Mi
backend-deployment-5d6d4c978-94d2z 3206m 99Mi
If NodeJS is only running a single thread, how could it be consuming more than 1000m CPU, when running directly on a Node it would only utilize a single core? Is kubernetes somehow letting it borrow time across cores?
Although Node.js runs the main application code in a single thread, the Node.js runtime is multi-threaded. Node.js has an internal worker pool that is used to run background tasks, including I/O and certain CPU-intensive processing like crypto functions. In addition, if you use the worker_threads facility (not to be confused with the worker pool), then you would be directly accessing additional threads in Node.js.
I have been following RabbitMQ tutorials to add publisher and consumer to NodeJS. But the documentation and general tutorials on internet lacks to give proper production setup for using RabbitMQ client with Nodejs Cluster setup.
From RabbitMQ tutorial channel.consume() starts a consumer. Does this consumer starts in the same thread as Nodejs is running? If I run 4 Nodejs child processes that means it will created 4 consumers, right?
What would be the correct way of starting Nodejs app that only runs RabbitMQ workers by taking worker count from environment variable?
From RabbitMQ tutorial channel.consume() starts a consumer. Does this consumer starts in the same thread as Nodejs is running?
Yes, consumers are also subject to the single-thread rule thus consuming synchronously can block your entire application.
If I run 4 Nodejs child processes that means it will created 4 consumers, right?
Yes
What would be the correct way of starting Nodejs app that only runs RabbitMQ workers by taking worker count from environment variable?
I'm not sure what is the logic behind this, but I would strongly advise against arbitrarily limiting the number of consumers, quite the contrary.
In order to keep your queues empty you'd usually want to use as much consuming power as you can.
If you still want to limit the number of RabbitMQ consumers regardless of how many available node processes, you'd have to write business logic involving communication between the master and it's child processes, which is not a trivial affair.
Did I understand correctly: If I use cluster package, does it mean that
a new node instance is created for each created worker?
What is the difference between cluster and worker_threads packages?
Effectively what you are differing is process based vs thread based. Threads share memory (e.g. SharedArrayBuffer) whereas processes don't. Essentially they are the same thing categorically.
cluster
One process is launched on each CPU and can communicate via IPC.
Each process has it's own memory with it's own Node (v8) instance. Creating tons of them may create memory issues.
Great for spawning many HTTP servers that share the same port b/c the master main process will multiplex the requests to the child processes.
worker threads
One process total
Creates multiple threads with each thread having one Node instance (one event loop, one JS engine). Most Node API's are available to each thread except a few. So essentially Node is embedding itself and creating a new thread.
Shares memory with other threads (e.g. SharedArrayBuffer)
Great for CPU intensive tasks like processing data or accessing the file system. Because NodeJS is single threaded, synchronous tasks can be made more efficient with workers
Trying to get my mind around workers vs threads on Node and Heroku. What happens when you call exec from within Node.js?
Is it correct to believe this runs on a separate thread, and does not block the main event loop?
require('child_process').exec(cmd, function (err, stdout, stderr) {
// ... do stuff
});
If on Heroku, is there an advantage to moving this to a separate worker? E.g.
If computational intensive, would child_process slow the main app?
Do worker dynos get their own memory limit?
Would an uncaught error (heaven forbid) not crash the main app if in a worker?
In Node, a child process is a real separate process on the CPU, which is a child of your parent process (your Node.js script). This article explains this in much more depth here: http://www.graemeboy.com/node-child-processes
What this means on Heroku, is that if you use child_process to spawn a new child process, your Heroku dyno will actually be able to do 'more' total CPU work, as it will be running your child process code (most likely) on a separate physical CPU (this is very dependent on a lot of factors in your application, however).
This can be a problem, however, because each Heroku dyno only has a limited amount of CPU and RAM resources.
So for instance, if your Dyno code (the web bit, not a separate Heroku worker) is doing CPU intensive stuff and using child_process a lot, you will use up all your CPU resources and your code will start to block / hang in Node.
A much better idea (although slightly more expensive on Heroku) is to put all worker / asynchronous code into a separate worker dyno, and use that EXCLUSIVELY for processing CPU intensive stuff. This ensures your main web dynos will always be fast and responsive, as much as possible.
I personally like to use a queueing service like Amazon SQS to handle passing data between my web dynos and my worker dynos, as it's super fast an inexpensive, but you have lots of options.
Every dyno you create (web dynos and worker dynos) get their own resources, so each dyno gets it's own set amount of CPU and RAM. The types of dynos available, and their resource limits, and explained here: https://devcenter.heroku.com/articles/dyno-types
In regards to error handling, if you don't catch an exception, it's tricky to say what will happen. It is, however, very possible that your entire Node app will crash (and then Heroku will just restart it). It really depends on your specific implementation of various things =/
I'm looking for a solution to create a "worker farm" using node.js. Basically we have an app in node and we need to send off "jobs" to be run across n number of worker servers. For example, let's say we have 5 servers that all run certain jobs, the jobs need to be distributed or 'queued' until a worker has CPU available to process the jobs.
One way to do this would be to have a worker server that is run on every separate machine. Each worker would pull from a queue based on it's CPU utilization or queue availability. The application itself, would add items to a queue (probably handled by Redis). There would be no direct communication between the individual worker servers and the application itself. One problem I could see with this is if multiple workers grab the same queue at the same time. The other method would be to somehow communicate with the worker servers from the application, which would get worker with the least resources and 'assign' the job to that particular worker or queue it up.
Does anyone know of a good solution for handling this?
Thank you!
I recommend kue, which runs on top of Redis. It gives you atomic queue operations, and each worker can get the next task from the queue. Take a look at resque for a more full-featured version of the same.