I have a general question about dask.compute() that's motivated by a memory buildup I've been experiencing with the function. I'm using dask.compute() and map_partitions() (have tried with dask.distributed and dask.multiprocessing (the later with both with pool=ThreadPool and pool=multiprocessing.pool)) to apply a function that performs a series of operations to chunks of a dask dataframe. The output of the function is a relatively small matrix, but the operations within the function involve really large intermediate matrices. Despite deleting these intermediates, I get a memory buildup over time that eventually causes my kernel to die. This makes me wonder if dask is allocating jobs based only on the expected size of the final output variable, and not on the large calculations within the function, leading to too many jobs being sent and the memory blow-up. Is this possible? Thanks for any insight on what might be going wrong.
There are a number of similar issues around (e.g., https://github.com/dask/distributed/issues/1795 and elsewhere). As in that issue, you may want to run typical python memory monitoring tools on the function first, to see if this is intrinsic behaviour.
Essentially, people have been experiencing memory build-up when creating and deleting a large number of pandas dataframes, and this appears to be a pandas problem unrelated to dask, or maybe even a deeper-level malloc thing. You can do typical things like strongly ensure that you do not keep references live, and you can call gc.collect() within your code.
Related
I have a simple node app with 1 function that defines 1000+ functions inside it (without running them).
When I call this function (the wrapper) around 200 times the RSS memory of the process spikes from 100MB to 1000MB and immediately goes down. (The memory spike only happens after around 200~ calls, before that all the calls do not cause a memory spike, and all the calls after do not cause a memory spike)
This issue is happening to us in our node server in production, and I was able to reproduce it in a simple node app here:
https://github.com/gileck/node-v8-memory-issue
When I use --jitless pr --no-opt the issue does not happen (no spikes). but obviously we do not want to remove all the v8 optimizations in production.
This issue must be some kind of a specific v8 optimization, I tried a few other v8 flags but non of them fix the issue (only --jitless and --no-opt fix it)
Anyone knows which v8 optimization could cause this?
Update:
We found that --no-concurrent-recompilation fix this issue (No memory spikes at all).
but still, we can't explain it.
We are not sure why it happens and which code changes might fix it (without the flag).
As one of the answers suggests, moving all the 1000+ function definitions out of the main function will solve it, but then those functions will not be able to access the context of the main function which is why they are defined inside it.
Imagine that you have a server and you want to handle a request.
Obviously, The request handler is going to run many times as the server gets a lot of requests from the client.
Would you define functions inside the request handler (so you can access the request context in those functions) or define them outside of the request handler and pass the request context as a parameter to all of them? We chose the first option... what do you think?
anyone knows which v8 optimization could cause this?
Load Elimination.
I guess it's fair to say that any optimization could cause lots of memory consumption in pathological cases (such as: a nearly 14 MB monster of a function as input, wow!), but Load Elimination is what causes it in this particular case.
You can see for yourself when your run with --turbo-stats (and optionally --turbo-filter=foo to zoom in on just that function).
You can disable Load Elimination if you feel that you must. A preferable approach would probably be to reorganize your code somewhat: defining 2,000 functions is totally fine, but the function defining all these other functions probably doesn't need to be run in a loop long enough until it gets optimized? You'll avoid not only this particular issue, but get better efficiency in general, if you define functions only once each.
There may or may not be room for improving Load Elimination in Turbofan to be more efficient for huge inputs; that's a longer investigation and I'm not sure it's worth it (compared to working on other things that likely show up more frequently in practice).
I do want to emphasize for any future readers of this that disabling optimization(s) is not generally a good rule of thumb for improving performance (or anything else), on the contrary; nor are any other "secret" flags needed to unlock "secret" performance: the default configuration is very carefully optimized to give you what's (usually) best. It's a very rare special case that a particular optimization pass interacts badly with a particular code pattern in an input function.
I struggle finding a word describing a solution that is not streaming but processes everything in phases or stages resulting in keeping everything in memory. Processing something in bulk is not the right solution.
To give you an example we currently have a mechanism that has a list of identifiers. The list has like a million entries and can be loaded into memory, processed and the memory is freed or in a streaming solution the list is loaded line by line, each line is instantly processed and the memory footprint is therefore superior to the first solution.
So what is the word to describe the first algorithm / solution.
I’m facing a similar naming issue, and decided to settle on “buffered”.
I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.
My application scenario is like this: I want to evaluate the performance gain one can achieve on a quad-core machine for processing the same amount of data. I have following two configurations:
i) 1-Process: A program without any threading and processes data from 1M .. 1G, while system was assumed to run only single core of its 4-cores.
ii) 4-threads-Process: A program with 4-threads (all threads performing same operation) but processing 25% of the input data.
In my program for creating 4-threads, I used pthread's default options (i.e., without any specific pthread_attr_t). I believe the performance gain of 4-thread configuration comparing to 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).
I profiled the time spent in creation of threads just like this below:
timer_start(&threadCreationTimer);
pthread_create( &thread0, NULL, fun0, NULL );
pthread_create( &thread1, NULL, fun1, NULL );
pthread_create( &thread2, NULL, fun2, NULL );
pthread_create( &thread3, NULL, fun3, NULL );
threadCreationTime = timer_stop(&threadCreationTimer);
pthread_join(&thread0, NULL);
pthread_join(&thread1, NULL);
pthread_join(&thread2, NULL);
pthread_join(&thread3, NULL);
Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on. Hence, structure of the code of my functions run by threads is like this:
timer_start(&threadTimer[i]);
while(!dataFinished[i])
{
threadTime[i] += timer_stop(&threadTimer[i]);
data_source();
timer_start(&threadTimer[i]);
process();
}
threadTime[i] += timer_stop(&threadTimer[i]);
Variable dataFinished[i] is marked true by process when the it received and process all needed data. Process() knows when to do that :-)
In the main function, I am calculating the time taken by 4-threaded configuration as below:
execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime.
And performance gain is calculated by simply
gain = execTime1process / execTime4Thread * 100
Issue:
On small data size around 1M to 4M, the performance gain is generally good (between 350% to 400%). However, the trend of performance gain is exponentially decreasing with increase in the input size. It keeps decreasing until some data size of upto 50M or so, and then become stable around 200%. Once it reached that point, it remains almost stable for even 1GB of data.
My question is can anybody suggest the main reasoning of this behaviour (i.e., performance drop at the start and but remaining stable later)?
And suggestions how to fix that?
For your information, I also investigated the behaviour of threadCreationTime and threadTime for each thread to see what's happening. For 1M of data the values of these variables are small and but with increase in the data size both these two variables increase exponentially (but threadCreationTime should remain almost same regardless of data size and threadTime should increase at a rate corresponding to data being processing). After keep on increasing until 50M or so threadCreationTime becomes stable and threadTime (just like performance drop becomes stable) and threadCreationTime keep increasing at a constant rate corresponding to increase in data to be processed (which is considered understandable).
Do you think increasing the stack size of each thread, process priority stuff or custom values of other parameters type of scheduler (using pthread_attr_init) can help?
PS: The results are obtained while running the programs under Linux's fail safe mode with root (i.e., minimal OS is running without GUI and networking stuff).
Since increase in the size of the input data may also increase in the
memory requirement of each thread, then so loading all data in advance
is definitely not a workable option. Therefore, in order to ensure not
to increase the memory requirement of each thread, each thread reads
data in small chunks, process it and reads next chunk process it and
so on.
Just this, alone, can cause a drastic speed decrease.
If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.
As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.
Within the threading,
threading context switches due to many threads can cause sub-optimal speedup
as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns
But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?
Some algorithm in your worker threads is likely to be suboptimal/expensive.
Are your thread starting on creation ? If it is the case, then the following will happen :
while your parent thread is creating thread, the thread already created will start to run. When you hit timerStop (ThreadCreation timer), the four have already run
for a certain time. So threadCreationTime overlaps threadTime[i]
As it is now, you don't know what you are measuring. This won't solve your problem, because obviously you have a problem since threadTime does not augment linearly, but at least you won't add overlapping times.
To have more info you can use the perf tool if it is available on your distro.
for example :
perf stat -e cache-misses <your_prog>
and see what happens with a two thread version, a three thread version etc...
There is nothing in the way the program uses this data which will cause the program to crash if it reads the old value rather than the new value. It will get the new value at some point.
However, I am wondering if reading and writing at the same time from multiple threads can cause problems for the OS?
I am yet to see them if it does. The program is developed in Linux using pthreads.
I am not interested in being told how to use mutexs/semaphores/locks/etc edit: so my program is only getting the new values, that is not what I'm asking.
No.. the OS should not have any problem. The tipical problem is the that you dont want to read the old values or a value that is half way updated, and thus not valid (and may crash your app or if the next value depends on the former, then you can get a corrupted value and keep generating wrong values all the itme), but if you dont care about that, the OS wont either.
Are the kernel/drivers reading that data for any reason (eg. it contains structures passed in to kernel APIs)? If no, then there isn't any issue with it, since the OS will never ever look at your hot memory.
Your own reads must ensure they are consistent so you don't read half of a value pre-update and half post-update and end up with a value that is neither pre neither post update.
There is no danger for the OS. Only your program's data integrity is at risk.
Imagine you data to consist of a set (structure) of values, which cannot be updated in an atomic operation. The reading thread is bound to read inconsistent data at some point (data consisting of a mixture of old and new values). But you did not want to hear about mutexes...
Problems arise when multiple threads share access to data when accessing that data is not atomic. For example, imagine a struct with 10 interdependent fields. If one thread is writing and one is reading, the reading thread is likely to see a struct that is halfway between one state and another (for example, half of it's members have been set).
If on the other hand the data can be read and written to with a single atomic operation, you will be fine. For example, imagine if there is a global variable that contains a count... One thread is incrementing it on some condition, and another is reading it and taking some action... In this case, there is really no intermediate inconsistent state. It's either got the new value, or it has the old value.
Logically, you can think of locking as a tool that lets you make arbitrary blocks of code atomic, at least as far as the other threads of execution are concerned.