I had a few manually started processes (with p.start()) for dealing with some background tasks, and I communicated with them via multiprocessing.Pipe(). So far, so good.
Now, I have to scale my application in a situation that, following the same structure, too many processes would be started.
So, I'm trying to port my code from having some manually started multiprocessing.Process's to a pool of processes. The problem is that multiprocessing.Pipe() does not seem to work with them. It seems that I should have to use a queue.
Specifically, I was using the code suggested in this stackovervlow answer to run some generators in background, but the problem is that now I have many generators.
Many thanks.
Related
I'm developing a lightweight framework to work as a coordinator in a Robotics competition I compete.
My idea, is to have agnostic programs about the whole, just with inputs that might triggers outputs. I then, connect those outputs to inputs, and can have different behaviours with the same modules, without hard work.
I'm planning on doing this with Node.js and WebKit, to allow a nice UI for modifying the process. However, each "module" might not really be a code wrapped in some javascript class-like function, it might be a real Thread, running maybe some C++ native code (without Node.js), or even a Python program.
What I'm facing now, is a fast way, and also generic, to exchange data among processes. I have read about it, but haven't got to any conclusions...
Here are the 3 methods I found out:
Local Socket: Uses the localhost to dispatch a broadcast to a port
Unix Socket: Maybe more efficient than the above (but using filesystem?)
Stdin/Out communication: When a process is launched by Node.js, binding the stdin and stdout can be used to communicate between the program.
So, I have those 3 ways of doing it, what should I use mostly? I need things to communicate REALLY fast (data might go through 5 different processes, and I need that not to exceed 2ms)
I'm working with NodeJs from a couple of time & it was doing a really good job when there is only IO. And then I faced this challenge.
We have a game in which every time on an average 250 users play. Currently its back-end server is running on java. But now we want to convert it in NodeJs.
So we were going good until when we reached to the game engine. Where there is so many CPU-bound jobs. When a user is getting served these CPU-Bound requests, all others are getting blocked. This is really normal I know so we tested all the solutions of this problem before abandoning the project.
Used the following:
callback
thread and threadPool from node_module webWorker-threads
created separate js file for all CPU-Bound jobs and ran them in process.exec
cluster
created each thing in different module
But except process.exec & cluster all are in vain. In these two solution cluster is also too much unpredictable. Because it happened that in a worker multiple requests are assigned & in the front there is a CPU-Bound job, in that case again same issue.
Only process.exec is working good. But we have so many CPU-Bound tasks, if we do a separate file for each of them then it will be a mess.
So I want to know if it is not at all possible in NodeJs or not. Anyone of stack-overflow community faced this issue and solved it or anyone want to give any solution regarding this, a big thanks to all of them...
We use clustering with our express apps on multi cpu boxes. Works well, we get the maximum use out of AWS linux servers.
We inherited an app we are fixing up. It's unusual in that it has two processes. It has an Express API portion, to take incoming requests. But the process that acts on those requests can run for several minutes, so it was build as a seperate background process, node calling python and maya.
Originally the two were tightly coupled, with the python script called by the request to upload the data. But this of course was suboptimal, as it would leave the client waiting for a response for the time it took to run, so it was rewritten as a background process that runs in a loop, checking for new uploads, and processing them sequentially.
So my question is this: if we have this separate node process running in the background, and we run clusters which starts up a process for each CPU, how is that going to work? Are we not going to get two node processes competing for the same CPU. We were getting a bit of weird behaviour and crashing yesterday, without a lot of error messages, (god I love node), so it's bit concerning. I'm assuming Linux will just swap the processes in and out as they are being used. But I wonder if it will be problematic, and I also wonder about someone getting their web session swapped out for several minutes while the longer running process runs.
The smart thing to do would be to rewrite this to run on two different servers, but the files that maya uses/creates are on the server's file system, and we were not given the budget to rebuild the way we should. So, we're stuck with this architecture for now.
Any thoughts now possible problems and how to avoid them would be appreciated.
From an overall architecture prospective, spawning 1 nodejs per core is a great way to go. You have a lot of interdependencies though, the nodejs processes are calling maya which may use mulitple threads (keep that in mind).
The part that is concerning to me is your random crashes and your "process that runs in a loop". If that process is just checking the file system you probably have a race condition where the nodejs processes are competing to work on the same input/output files.
In theory, 1 nodejs process per core will work great and should help to utilize all your CPU usage. Linux always swaps the processes in and out so that is not an issue. You could start multiple nodejs per core and still not have an issue.
One last note, be sure to keep an eye on your memory usage, several linux distributions on EC2 do not have a swap file enabled by default, running out of memory can be another silent app killer, best to add a swap file in case you run into memory issues.
I have a simple nodejs webserver running, it:
Accepts requests
Spawns separate thread to perform background processing
Background thread returns results
App responds to client
Using Apache benchmark "ab -r -n 100 -c 10", performing 100 requests with 10 at a time.
Average response time of 5.6 seconds.
My logic for using nodejs is that is typically quite resource efficient, especially when the bulk of the work is being done by another process. Seems like the most lightweight webserver option for this scenario.
The Problem
With 10 concurrent requests my CPU was maxed out, which is no surprise since there is CPU intensive work going on the background.
Scaling horizontally is an easy thing to, although I want to make the most out of each server for obvious reasons.
So how with nodejs, either raw or some framework, how can one keep that under control as to not go overkill on the CPU.
Potential Approach?
Could accepting the request storing it in a db or some persistent storage and having a separate process that uses an async library to process x at a time?
In your potential approach, you're basically describing a queue. You can store incoming messages (jobs) there and have each process get one job at the time, only getting the next one when processing the previous job has finished. You could spawn a number of processes working in parallel, like an amount equal to the number of cores in your system. Spawning more won't help performance, because multiple processes sharing a core will just run slower. Keeping one core free might be preferred to keep the system responsive for administrative tasks.
Many different queues exist. A node-based one using redis for persistence that seems to be well supported is Kue (I have no personal experience using it). I found a tutorial for building an implementation with Kue here. Depending on the software your environment is running in though, another choice might make more sense.
Good luck and have fun!
I am in the process of beginning to write a worker queue for node using node's cluster API and mongoose.
I noticed that a lot of libs exist that already do this but using redis and forking. Is there a good reason to fork versus using the cluster API?
edit and now i also find this: https://github.com/xk/node-threads-a-gogo -- too many options!
I would rather not add redis to the mix since I already use mongo. Also, my requirements are very loose, I would like persistence but could go without it for the first version.
Part two of the question:
What are the most stable/used nodejs worker queue libs out there today?
Wanted to follow up on this. My solution ended up being a roll your own cluster impl where some of my cluster workers are dedicated job workers (ie they just have code to work on jobs).
I use agenda for job scheduling.
Cron type jobs are scheduled by the cluster master. The rest of the jobs are created in the non-worker clusters as they are needed. (verification emails etc)
Before that I was using kue but dropped it because the rest of my app uses mongodb and I didnt like having to use redis just for job scheduling.
Have u tried https://github.com/rvagg/node-worker-farm?
It is very light weight and doesn't require a separate server.
I personally am partial to cluster-master.
https://github.com/isaacs/cluster-master
The reason I like cluster master is because it does very little besides add in logic for forking your process, and give you the ability to manage the number of process you're running, and a little bit of logging/recovery to boot! I find overly bloated process management libraries tend to be unstable, and sometimes even slow things down.
This library will be good for you if the following are true:
Your module is largely asynchronous
You don't have a huge amount of different types of events triggering
The events that fire have small amounts of work to do, but you have lots of similar events firing(things like web servers)
The reason for the above list, is the reason why threads-a-gogo may be good for you, for the opposite reasons. If you have a few spots in your code, where there is a lot of work to do within your event loop, something like threads-a-gogo that launches a "thread" specifically for this work is awesome, because you aren't determining ahead of time how many workers to spawn, but rather spawning them to do work when needed. Note: this can also be bad if there is the potential for a lot of them to spawn, if you start launching too many processes things can actually bog down, but I digress.
To summarize, if your module is largely asynchronous already, what you really want is a worker pool. To minimize the down time when your process is not listening for events, and to maximize the amount of processor you can use. Unless you have a very busy syncronous call, a single node event loop will have troubles taking advantage of even a single core of a processor. Under this circumstance, you are best off with cluster-master. What I recommend is doing a little benchmarking, and see how much of a single core your program can use under the "worst case scenario". Let's say this is 33% of one core. If you have a quad core machine, you then tell cluster master to launch you 12 workers.
Hope this helped!