I notice that the following code using multiple threads and keep all CPU cores busy about 100% while it is reading the file.
scala.io.Source.fromFile("huge_file.txt").toList
and I assume the following is the same
scala.io.Source.fromFile("huge_file.txt").foreach
I interrupt this code as a unit test under Eclipse debugger on my dev machine (OS X 10.9.2) and showing these threads: main, ReaderThread, 3 Daemon System Thread. htop shows all threads are busy if I run this in a scala console in a 24-cores server machine (ubuntu 12).
Questions:
How do I limit this code on using N number of threads?
For the sake of understanding the system performance, can you explain to me what, why and how this is done in io.Source? Reading the source doesn't helping.
I assume each line is read in sequence; however, since it is using multiple threads, so is the foreach run in multiple threads? My debugger seems to tell me that the code still run in the main thread.
Any insight would be appreciated.
As suggested, I put my findings here.
I use the following to test my dummy code with and without -J-XX:+UseSerialGC option
$ scala -J-XX:+UseSerialGC
scala> var c = 0
scala> scala.io.Source.fromFile("huge_file.txt").foreach(e => c += e)
Before I use the option, all 24 cores in my server machine are busy during the file read. After the option, only two threads are busy.
Here is the memory profile I captured on my dev machine, not server. I first perform the GC to get the baseline, then I run the above code several times. The Eden Space got clean up periodically. The memory swing is about 20M, while the smaller file I read is about 200M i.e. io.Source creates 10% of temporary objects per each run.
This characteristics will create trouble in a shared system. This will also limit us to handle multiple big files all at once. This stresses memory, i/o and CPU usage in a way that I can't run my code with other production jobs, but run it separately to avoid having this system impact.
If you know a better way or suggestion to handle this situation in a real shared production environment, please let me know.
Related
I would like to cache a large amount of data in a Flask application. Currently it runs on K8S pods with the following unicorn.ini
bind = "0.0.0.0:5000"
workers = 10
timeout = 900
preload_app = True
To avoid caching the same data in those 10 workers I would like to know if Python supports a way to multi-thread instead of multi-process. This would be very easy in Java but I am not sure if it is possible in Python. I know that you can share cache between Python instances using the file system or other methods. However it would be a lot simpler if it is all share in the same process space.
Edited:
There are couple post that suggested threads are supported in Python. This comment by Filipe Correia, or this answer in the same question.
Based on the above comment the Unicorn design document talks about workers and threads:
Since Gunicorn 19, a threads option can be used to process requests in multiple threads. Using threads assumes use of the gthread worker.
Based on how Java works, to shared some data among threads, I would need one worker and multiple threads. Based on this other link
I know it is possible. So I assume I can change my gunicorn configuration as follows:
bind = "0.0.0.0:5000"
workers = 1
threads = 10
timeout = 900
preload_app = True
This should give me 1 worker and 10 threads which should be able to process the same number of request as current configuration. However the question is: Would the cache still be instantiated once and shared among all the threads? How or where should I instantiate the cache to make sure is shared among all the threads.
would like to ... multi-thread instead of multi-process.
I'm not sure you really want that. Python is rather different from Java.
workers = 10
One way to read that is "ten cores", sure.
But another way is "wow, we get ten GILs!"
The global interpreter lock must be held
before the interpreter interprets a new bytecode instruction.
Ten interpreters offers significant parallelism,
executing ten instructions simultaneously.
Now, there are workloads dominated by async I/O, or where
the interpreter calls into a C extension to do the bulk of the work.
If a C thread can keep running, doing useful work
in the background, and the interpreter gathers the result later,
terrific. But that's not most workloads.
tl;dr: You probably want ten GILs, rather than just one.
To avoid caching the same data in those 10 workers
Right! That makes perfect sense.
Consider pushing the cache into a storage layer, or a daemon like Redis.
Or access memory-resident cache, in the context of your own process,
via mmap or shmat.
When running Flask under Gunicorn, you are certainly free
to set threads greater than 1,
though it's likely not what you want.
YMMV. Measure and see.
Is there a way we could concurrently run functions on CPU and GPU (using Python)? I'm already using Numba to do thread level scheduling for compute intensive functions on the GPU, but I now also need to add parallelism between CPU-GPU. Once we ensure that the GPU shared memory has all the data to start processing, I need to trigger the GPU start and then in parallel run some functions on the host using the CPU.
I'm sure that the time taken by GPU to return the data is much more than the CPU to finish a task. So that once the GPU has finished processing, CPU is already waiting to fetch the data to the host. Is there a standard library/way to achieve this? Appreciate any pointers in this regard.
Thanks Robert and Ander. I was thinking on similar lines but wasn't very sure. I checked that until I put some synchronization for task completion between the cores, (for ex. cp.cuda.Device().synchronize() when using CuPy) I'm effectively running GPU-CPU in parallel. Thanks again. A general flow with Numba, to make gpu_function and cpu_function run in parallel will be something like the following:
""" GPU has buffer full to start processing Frame N-1 """
tmp_gpu = cp.asarray(tmp_cpu)
gpu_function(tmp_gpu)
""" CPU receives Frame N over TCP socket """
tmp_cpu = cpu_function()
""" For instance we know cpu_function takes [a little] longer than gpu_function """
cp.cuda.Device().synchronize()
Of course, we could even do away with the time spent in transferring tmp_cpu to tmp_gpu by employing PING-PONG buffer and initial frame delay.
In nodejs the main critics are based on its single threaded event loop model.
The biggest disadvantage of nodejs is that one can not perform CPU intensive tasks in the application. For demonstration purpose, lets take the example of a while loop (which is perhaps analogous to a db function returning hundred thousand of records and then processing those records in nodejs.)
while(1){
x++
}
Such sort of the code will block the main stack and consequently all other tasks waiting in the Event Queue will never get the chance to be executed. (and in a web Applications, new users will not be able to connect to the App).
However, one could possibly use module like cluster to leverage the multi core system and partially solve the above issue. The Cluster module allows one to create a small network of separate processes which can share server ports, which gives the Node.js application access to the full power of the server. (However, one of the biggest disadvantage of using Cluster is that the state cannot be maintained in the application code).
But again there is a high possibility that we would end up in the same situation (as described above) again if there is too much server load.
When I started learning the Go language and had a look at its architecture and goroutines, I thought it would possibly solve the problem that arises due to the single threaded event loop model of nodejs. And that it would probably avoid the above scenario of CPU intensive tasks, until I came across this interesting code, which blocks all of the GO application and nothing happens, much like a while loop in nodejs.
func main() {
var x int
threads := runtime.GOMAXPROCS(0)
for i := 0; i < threads; i++ {
go func() {
for { x++ }
}()
}
time.Sleep(time.Second)
fmt.Println("x =", x)
}
//or perhaps even if we use some number that is just greater than the threads.
So, the question is, if I have an application which is load intensive and there would be lot of CPU intensive tasks as well, I could probably get stuck in the above sort of scenario. (where db returns numerous amount of rows and then the application need to process and modify some thing in those rows). Would not the incoming users would be blocked and so would all other tasks as well?
So, how could the above problem be solved?
P.S
Or perhaps, the use cases I have mentioned does not make much of the sense? :)
Currently (Go 1.11 and earlier versions) your so-called
tight loop will indeed clog the code.
This would happen simply because currently the Go compiler
inserts code which does "preemption checks" («should I yield
to the scheduler so it runs another goroutine?») only in
prologues of the functions it compiles (almost, but let's not digress).
If your loop does not call any function, no preemption checks
will be made.
The Go developers are well aware of this
and are working on eventually alleviating this issue.
Still, note that your alleged problem is a non-issue in
most real-world scenarious: the code which performs long
runs of CPU-intensive work without calling any function
is rare and far in between.
In the cases, where you really have such code and you have
detected it really makes other goroutines starve
(let me underline: you have detected that through profiling—as
opposed to just conjuring up "it must be slow"), you may
apply several techniques to deal with this:
Insert calls to runtime.Gosched() in certain key points
of your long-running CPU-intensive code.
This will forcibly relinquish control to another goroutine
while not actually suspending the caller goroutine (so it will
run as soon as it will have been scheduled again).
Dedicate OS threads for the goroutines running
those CPU hogs:
Bound the set of such CPU hogs to, say, N "worker goroutines";
Put a dispatcher in front of them (this is called "fan-out");
Make sure that N is sensibly smaller than runtime.GOMAXPROCS
or raise the latter so that you have those N extra threads.
Shovel units of work to those dedicated goroutines via the dispatcher.
Concise-ish problem explanation:
I'd like to be able to run multiple (we'll say a few hundred) shell commands, each of which starts a long running process and blocks for hours or days with at most a line or two of output (this command is simply a job submission to a cluster). This blocking is helpful so I can know exactly when each finishes, because I'd like to investigate each result and possibly re-run each multiple times in case they fail. My program will act as a sort of controller for these programs.
for all commands in parallel {
submit_job_and_wait()
tries = 1
while ! job_was_successful and tries < 3{
resubmit_with_extra_memory_and_wait()
tries++
}
}
What I've tried/investigated:
I was so far thinking it would be best to create a thread for each submission which just blocks waiting for input. There is enough memory for quite a few waiting threads. But from what I've read, perl threads are closer to duplicate processes than in other languages, so creating hundreds of them is not feasible (nor does it feel right).
There also seem to be a variety of event-loop-ish cooperative systems like AnyEvent and Coro, but these seem to require you to rely on asynchronous libraries, otherwise you can't really do anything concurrently. I can't figure out how to make multiple shell commands with it. I've tried using AnyEvent::Util::run_cmd, but after I submit multiple commands, I have to specify the order in which I want to wait for them. I don't know in advance how long each submission will take, so I can't recv without sometimes getting very unlucky. This isn't really parallel.
my $cv1 = run_cmd("qsub -sync y 'sleep $RANDOM'");
my $cv2 = run_cmd("qsub -sync y 'sleep $RANDOM'");
# Now should I $cv1->recv first or $cv2->recv? Who knows!
# Out of 100 submissions, I may have to wait on the longest one before processing any.
My understanding of AnyEvent and friends may be wrong, so please correct me if so. :)
The other option is to run the job submission in its non-blocking form and have it communicate its completion back to my process, but the inter-process communication required to accomplish and coordinate this across different machines daunts me a little. I'm hoping to find a local solution before resorting to that.
Is there a solution I've overlooked?
You could rather use Scientific Workflow software such as fireworks or pegasus which are designed to help scientists submit large numbers of computing jobs to shared or dedicated resources. But they can also do much more so it might be overkill for your problem, but they are still worth having a look at.
If your goal is to try and find the tightest memory requirements for you job, you could also simply submit your job with a large amount or requested memory, and then extract actual memory usage from accounting (qacct), or , cluster policy permitting, logging on the compute node(s) where your job is running and view the memory usage with top or ps.
There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?
Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.
I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.
The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.
you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here