Memory doesn't get freed in multiple threads - garbage-collection

I've been using Julia for multi-thread processing of large amounts of data and observed one interseting pattern. The memory usage (reported by htop) slowly grows until the process is killed by OS. The project is complex and it is hard to produce a suitable MWE, but I carried out a simple experiment:
using Base.Threads
f(n) = Threads.#threads for i=1:n
x = zeros(n)
end
Now, I called f(n) repeatedly for various values of n (somewhere between 10^4 and 10^5 on my 64 Gb machine). The result is that sometimes everything works as expected and memory gets freed after return, however sometimes this is not the case and an amount of used memory reported by htop hangs at a large value even though it seems no computations are made:
Explicit garbage collecting GC.gc() helps only a little, some memory is freed, but only a small chunk. Also, calling GC.gc() sometimes in the loop in function f helps, but the problem persists and, of course, performance is decreased. After exiting Julia, the allocated memory gets back to normal (probably freed by OS).
I've read about how julia manages its memory and how the memory is released only when memory tally is bigger than some value. But in my case, it results into process being killed by OS. It seems to me that GC somehow loses track of all allocated memory
Could anybody please explain this behaviour and how to prevent it without slowing down the code by repetitive calling of GC.gc()? And why is garbage collection broken in this way?
More details:
This happens only when processing large data (allocating large amount of memory) in multiple threads. I couldn't reproduce the same thing with only one thread.
I checked my code for everything I know about that can increase memory consumption (global variables, type stability, ...) with no positive results. As far as I know these problems result in higher memory allocation, my problem here is that memory doesn't get freed after returning from a function.
Here is my versioninfo output:
julia> versioninfo()
Julia Version 0.7.0
Commit a4cb80f3ed (2018-08-08 06:46 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 8124M CPU # 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 36

Since this was asked quite a while ago, hopefully this no longer occurs -- though I cannot test to be sure without an MWE.
One point which may be worth noting, however, is that the Julia garbage collector is single-threaded; i.e., there is only ever one garbage collector no matter how many threads you have generating garbage.
Consequently, if you are going to be generating a lot of garbage in a parallel workflow, it is generally good advice to use multiprocessing (i.e. MPI.jl or Distributed.jl) rather than multithreading. In multiprocessing, in contrast to multithreading, each process gets its own GC.

I have been experiencing similar issues when using Distributed.jl. Likewise, I have tried GC.gc() on each worker process, which helps alleviate memory consumption somewhat but for large tasks it eventually cripples the machine/process and the only fix is to restart Julia.
I could reproduce your MWE on Julia 1.7.1:
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC-Rome Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
Environment:
JULIA_NUM_THREADS = 32
I also have a MWE using CSV.jl, which will use all available CPUS to read in a large file:
##start julia such that multiple processes are available i.e. Threads.nthreads()>1
Threads.nthreads()
using CSV,DataFrames
## memory usage in GB after starting julia and loading packages:
used_mem() = (println("$(round((Sys.total_memory()-Sys.free_memory())/2^30 -9))G used"))
used_mem()
## create large file for CSV.jl to read (you can adjust n as appropriate for your machine, this maxes out at about 7.5GB on my machine)
n = 100000000
CSV.write("test.csv",DataFrame(repeat([(1,1,1)],n)))
used_mem()
##during the above process, memory peaks, but running garbage collection returns it to original state
GC.gc()
used_mem()
##now we load in test.csv, using all available CPUS
CSV.read("test.csv",DataFrame)
## there is now a very large dataframe in ans, so memory usage is high again
used_mem()
##clear ans and collect garbage
1+1
GC.gc()
### if Threads.nthreads()>1, memory usage is still very high, even with nothing running and no big variables to explain it
## If Threads.nthreads()=1 (i.e. start julia with `JULIA_NUM_THREADS=1`) then used_mem() is as on initialisation
varinfo()
used_mem()
I did not come up with a MWE for Distributed.jl usecases, but that is where I mainly run into problems in my application. Interestingly, some memory is freed when I kill all worker processes, but some memory holding continues after that as well.

Related

Mono 4.2.2 garbage collection really slow/leaking on Linux with multiple threads?

I have an app that processes 3+GB of data into 300MB of data. Run each independent dataset sequentially on the main thread, its memory usage tops out at about 3.5GB and it works fine.
If I run each dataset concurrently on 10 threads, I see the following:
Virtual memory usage climbs steadily until allocations fail and it crashes. I can see GC is trying to run in the stack trace)
CPU utilization is 1000% for periods, then goes down to 100% for minutes, and then cycles back up. The app is easily 10x slower when run with multiple threads, even though they are completely independent.
This is mono 4.2.2 build for Linux with large heap support, running on 128GB RAM with 40 logical processors. I am running mono-sgen and have tried all the custom GC settings I could think of (concurrent mark-sweep, max heap size, etc).
These problems do not happen on Windows. If I rewrite code to do significant object pooling, I get farther in the dataset before running OOM, but the fate is the same. I have verified that I have no memory leaks using multiple tools and good-old printf-debugging.
My best theory is that lots of allocations across lots of threads are a weak case for the GC, and most of that wall-clock time is spent with my work threads suspended.
Does anyone have any experience with this? Is there a way I can help the GC get out of that 100% rut it gets stuck in, and to not run out of memory?

vc++ 64bit program crashes on a new operator when consume several GBs of memory

I have written a 64bit program which consumes a large bunk of memory. When it consumes several GB of memory, it causes an error on a new operator. But in fact there is still several GB of free memory on this machine to spend. And other program runs correctly with much more memory than this one. And I have enabled compilation option /Zm2000 and link option /LARGEADDRESSAWARE.
What is the cause then?
Even if you still have several GB free, you have to remember that memory can become fragmented. If there is no contiguous block available to satisfy your request, then the allocation will fail with an exception.

CUDA Process Memory Usage [duplicate]

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.
Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?
Can anyone help me answer these questions? Thanks
Edit 1: operating system: x86_64 GNU/Linux
CUDA version: 4.0
Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
Edit 2: The following is what I got after doing some research. Feel free to correct me.
CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.
New questions:
1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory.
2. Is there any way to restructure the GPU memory ?
The device memory available to your code at runtime is basically calculated as
Free memory = total memory
- display driver reservations
- CUDA driver reservations
- CUDA context static allocations (local memory, constant memory, device code)
- CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
- CUDA context user allocations (global memory, textures)
if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.
The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.
GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

Memory of type "root-set" reallocation Error - Erlang

I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.

Does an Application memory leak cause an Operating System memory leak?

When we say a program leaks memory, say a new without a delete in c++, does it really leak? I mean, when the program ends, is that memory still allocated to some non-running program and can't be used, or does the OS know what memory was requested by each program, and release it when the program ends? If I run that program a lot of times, will I run out of memory?
No, in all practical operating systems, when a program exits, all its resources are reclaimed by the OS. Memory leaks become a more serious issue in programs that might continue running for an extended time and/or functions that may be called often from the same program.
On operating systems with protected memory (Mac OS 10+, all Unix-clones such as Linux, and NT-based Windows systems meaning Windows 2000 and younger), the memory gets released when the program ends.
If you run any program often enough without closing it in between (running more and more instances at the same time), you will eventually run out of memory, regardless of whether there is a memory leak or not, so that's also true of programs with memory leaks. Obviously, programs leaking memory will fill the memory faster than an identical program without memory leaks, but how many times you can run it without filling the memory depends much rather on how much memory that program needs for normal operation than whether there's a memory leak or not. That comparison is really not worth anything unless you are comparing two completely identical programs, one with a memory leak and one without.
Memory leaks become the most serious when you have a program running for a very long time. Classic examples of this is server software, such as web servers. With games or spreadsheet programs or word processors, for instance, memory leaks aren't nearly as serious because you close those programs eventually, freeing up the memory. But of course memory leaks are nasty little beasts which should always be tackled as a matter of principle.
But as stated earlier, all modern operating systems release the memory when the program closes, so even with a memory leak, you won't fill up the memory if you're continuously opening and closing the program.
Leaked memory is returned by the OS after the execution has stopped.
That's why it isn't always a big problem with desktop applications, but its a big problem with servers and services (they tend to run long times.).
Lets look at the following scenario:
Program A ask memory from the OS
The OS marks the block X as been used by A and returns it to the program.
The program should have a pointer to X.
The program returns the memory.
The OS marks the block as free. Using the block now results in a access violation.
Program A ends and all memory used by A is marked unused.
Nothing wrong with that.
But if the memory is allocated in a loop and the delete is forgotten, you run into real problems:
Program A ask memory from the OS
The OS marks the block X as been used by A and returns it to the program.
The program should have a pointer to X.
Goto 1
If the OS runs out of memory, the program probably will crash.
No. Once the OS finishes closing the program, the memory comes back (given a reasonably modern OS). The problem is with long-running processes.
When the process ends, the memory gets cleared as well. The problem is that if a program leaks memory, it will requests more and more of the OS to run, and can possibly crash the OS.
It's more leaking in the sense that the code itself has no more grip on the piece of memory.
The OS can release the memory when the program ends. If a leak exists in a program then it is just an issue whilst the program is running. This is a problem for long running programs such as server processes. Or for example, if your web browser had a memory leak and you kept it running for days then it would gradually consume more memory.
As far as I know, on most OS when a program is started it receives a defined segment of memory which will be completely liberated once the program is ended.
Memory leaks are one of the main reason why garbage collector algorithms were invented since, once plugged into the runtime, they become responsible in reclaiming the memory that is no longer accessible by a program.
Memory leaks don't persist past end of execution so a "solution" to any memory leak is to simply end program execution. Obviously this is more of an issue on certain types of software. Having a database server which needs to go offline every 8 hours due to memory leaks is more of an issue than a video game which needs to be restarted after 8 hours of continual play.
The term "leak" refers to the fact that over time memory consumption will grow without any increased benefit. The "leaked" memory is memory neither used by the program nor usable by the OS (and other programs).
Sadly memory leaks are very common in unmanaged code. I have had firefox running for a couple days now and memory usage is 424MB despite only having 4 tabs open. If I closed firefox and re-opened the same tabs memory usage would likely be <100MB. Thus 300+ MB has "leaked".

Resources