Understanding output of GHC's +RTS -t -RTS option

Understanding output of GHC's +RTS -t -RTS option - haskell

I'm benchmarking the memory consumption of a haskell programm compiled with GHC. In order to do so, I run the programm with the following command line arguments: +RTS -t -RTS. Here's an example output:
<<ghc: 86319295256 bytes, 160722 GCs, 53963869/75978648 avg/max bytes residency (386 samples), 191M in use, 0.00 INIT (0.00 elapsed), 152.69 MUT (152.62 elapsed), 58.85 GC (58.82 elapsed) :ghc>>.
According to the ghc manual, the output shows:
The total number of bytes allocated by the program over the whole run.
The total number of garbage collections performed.
The average and maximum "residency", which is the amount of live data in bytes. The runtime can only determine the amount of live data during a major GC, which is why the number of samples corresponds to the number of major GCs (and is usually relatively small).
The peak memory the RTS has allocated from the OS.
The amount of CPU time and elapsed wall clock time while initialising the runtime system (INIT), running the program itself (MUT, the mutator), and garbage collecting (GC).
Applied to my example, it means that my program shuffles 82321 MiB (bytes divided by 1024^2) around, performs 160722 garbage collections, has a 51MiB/72MiB average/maximum memory residency, allocates at most 191M memory in RAM and so on ...
Now I want to know, what »The average and maximum "residency", which is the amount of live data in bytes« is compared to »The peak memory the RTS has allocated from the OS«? And also: What uses the remaining space of roughly 120M?
I was pointed here for more information, but that does not state clearly, what I want to know. Another source (5.4.4 second item) hints that the 120M memory is used for garbage collection. But that is too vague – I need a quotable information source.
So please, is there anyone who could answer my questions with good sources as proofs?
Kind regards!

The "resident" size is how much live Haskell data you have. The amount of memory actually allocated from the OS may be higher.
The RTS allocates memory in "blocks". If your program needs 7.3 blocks of of RAM, the RTS has to allocate 8 blocks, 0.7 of which is empty space.
The default garbage collection algorithm is a 2-space collector. That is, when space A fills up, it allocates space B (which is totally empty) and copies all the live data out of space A and into space B, then deallocates space A. That means that, for a while, you're using 2x as much RAM as is actually necessary. (I believe there's a switch somewhere to use a 1-space algorithm which is slower but uses less RAM.)
There is also some overhead for managing threads (especially if you have lots), and there might be a few other things.
I don't know how much you already know about GC technology, but you can try reading these:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
http://www.mm-net.org.uk/workshop190404/GHC%27s_Garbage_Collector.ppt

Related

Peak heap memory usage of an OCaml program

I would like to compute the peak memory usage of my OCaml program when it is running in compiled form as native code. I considered using the stats API in the Gc module, but it seems to return a snapshot at the time it is called. Is there some information in the Gc module or some other module that I can use to get peak heap usage just before my program terminates?

You can get the current size of the major heap using Gc.stat, see the live_words field, multiply it by the word size in bytes to get the size in bytes (8 in a 64 bit system). It doesn't really matter but you can also add the size of the minor heap to the calculation, which is available via Gc.get () see the minor_heap_size field (again in words).
You can create an alarm with Gc.create_alarm to check the size of the heap after each major collection to get the maximum size ever used.

How to automate garbage Collection execution using java melody?

Im using java melody to monitor memory usage in production environment.
The requirement is memory not should exceed 256MB/512MB .
I have done maximum of code optimized but still the usage is 448MB/512MB but when i executed garbage collector in java melody manually the memory consumption is 109MB/512MB.

You can invoke the garbage collection using one of these two calls in your code (they're equivalent):
System.gc();
Runtime.getRuntime().gc();
It's better if you place the call in a Runnable that gets invoked periodically, depending on how fast your memory limit is reached.

Why you actually care about heap usage? as long as you set XMS (maximum heap) you are fine. Let java invoke GC when it seems fit. As long as you have free heap it is no point doing GC and freeing heap just for sake of having a lot of free heap.
If you want to limit memory allocated by process XMX is not enough. You should also limit native memory.
What you should care about is
Memory leaks, consecutive Full GCs, GC starvation
GC KPIs: Latency, throughput, footprint
Object creation rate, promotion rate, reclamation rate…
GC Pause time statistics: Duration distribution, average, count, average interval, min/max, standard deviation
GC Causes statistics: Duration, Percentage, min/max, total
GC phases related statistics: Each GC algorithm has several sub-phases. Example for G1: initial-mark, remark, young, full, concurrent mark, mixed
See https://blog.gceasy.io/2017/05/30/improving-your-performance-reports/ https://blog.gceasy.io/2017/05/31/gc-log-analysis-use-cases/ for more technical details. You could also analyze your GC logs using https://blog.gceasy.io/ it will help you understand how your JVM is using memory.

why is kdb/q showing a big difference between used and heap space after GC

As per this page, for versions 2.6 and 2.7 (http://www.timestored.com/kdb-guides/memory-management)
2.6 Unreferenced memory blocks over 32MB/64MB are returned immediately
2.7 Unreferenced memory blocks returned when memory full or .Q.gc[] called
But in both versions, there is a significant difference between used and heap space shown by .Q.w[]. This difference only grows as I run the function again. In 2.6, a difference could occur due to fragmentation (allocating many small objects) but I am not confident it accounts for this big of a difference. In 2.7, even after running .Q.gc[], it shows a significant difference. I would like to understand fundamentally the reason for this difference in the two versions as highlighted below.
This is behavior I am seeing in 2.6 and 2.7:
2.6:
used| 11442889952
heap| 28588376064
2.7 (after running .Q.gc[])
used| 11398025856
heap| 16508780544

Automatic Garbage collection doesn't clear small objects (<32MB) . In that case manual GC call is required. If your process has lot of unreferenced small objects then that will add up in heap size and not to used size.
Second, since KDB allocates memory in power of 2, that makes the difference between used and heap memory. For ex. if a vector requires 64000 bytes, it will be assigned a memory block of size 2^16 = 65536 bytes. And boundary cases makes this difference huge, for ex. if vector requires 33000 bytes (just over 2^15) it will be allocated 65536 bytes (2^16).
Following site has good explanation of GC behavior:
http://www.aquaq.co.uk/q/garbage-collection-kdb/

How much memory did Linux give to malloc()?

This is a Linux system question, not a coding question. When I use "top" to check the memory usage of my program, it reports a value 3-4 times as large as the actual heap allocation as given by Valgrind's Massif, a memory profiler. It's a large program, and the difference is hundreds of megabytes. The Valgrind manual gives only a partial explanation:
(Massif) does not directly measure memory allocated with
lower-level system calls such as mmap, mremap, and brk.
Heap allocation functions such as malloc are built on top of these
system calls. For example, when needed, an allocator will typically
call mmap to allocate a large chunk of memory, and then hand over
pieces of that memory chunk to the client program in response to calls
to malloc et al. Massif directly measures only these higher-level
malloc et al calls, not the lower-level system calls.
Fine, but how much memory am I really taking away from the system? I need to be able to run as many instances of this program as possible on one machine, so I need to know how much of that memory is still available. Page alignment etc. cannot explain a difference of hundreds of megabytes in reported memory usage.
Also, what determines the block size of the underlying mmap() call? I'm seeing blocks of 64MB at a time being taken according to top, which seems bizarrely large.

Any malloc implementation will be optimised for applications with huge memory requirements, because apps with low requirements run just fine anyway, and virtual memory is cheap.
For example, you will find malloc implementations that use a block of memory for up to 1024 mallocs of up to 16 bytes, another block for up to 1024 mallocs of up to 32 bytes, and so on. With a few mallocs this is inefficient but still cheap. With gazillions of mallocs, it makes malloc very efficient.
So saying "4 times as much" can be completely pointless. Tell us how many megabytes more than you thought.

How much extra memory does garbage collection require?

I heard once that for a language to implement and run garbage collection correctly there is on average of 3x more memory required. I am not sure if this is assuming the application is small, large or either.
So i wanted to know if theres any research or actually numbers of garbage collection overhead. Also i want to say GC is a very nice feature.

The amount of memory headroom you need depends on the allocation rate within your program. If you have a high allocation rate, you need more room for growth while the GC works.
The other factor is object lifetime. If your objects typically have a very short lifetime, then you may be able to manage with slightly less headroom with a generational collector.
There are plenty of research papers that may interest you. I'll edit a bit later to reference some.
Edit (January 2011):
I was thinking of a specific paper that I can't seem to find right now. The ones below are interesting and contain some relevant performance data. As a rule of thumb, you are usually ok with about twice as much memory available as your program residency. Some programs need more, but other programs will perform very well even in constrained environments. There are lots of variables that influence this, but allocation rate is the most important one.
Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance
Myths and realities: the performance impact of garbage collection
Edit (February 2013): This edit adds a balanced perspective on a paper cited, and also addresses objections raised by Tim Cooper.
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management, as noted by Natan Yellin, is actually the reference I was first trying to remember back in January 2011. However, I don't think the interpretation Natan has offered is correct. That study does not compare GC against conventional manual memory management. Rather it compares GC against an oracle which does perfect explicit releases. In otherwords, it leaves us not know how well conventional manual memory management compares to the magic oracle. It is also very hard to find this out because the source programs are either written with GC in mind, or with manual memory management in mind. So any benchmark retains in inherent bias.
Following Tim Cooper's objections, I'd like to clarify my position on the topic of memory headroom. I do this mainly for posterity, as I believe Stack Overflow answers should serve as a long-term resource for many people.
There are many memory regions in a typical GC system, but three abstract kinds are:
Allocated space (contains live, dead, and untraced objects)
Reserved space (from which new objects are allocated)
Working region (long-term and short-term GC data structures)
What is headroom anyway? Headroom is the minimum amount of reserved space needed to maintain a desired level of performance. I believe that is what the OP was asking about. You can also think of the headroom as memory additional to the actual program residency (maximum live memory) neccessary for good performance.
Yes -- increasing the headroom can delay garbage collection and increase throughput. That is important for offline non-critical operations.
In reality most problem domains require a realtime solution. There are two kinds of realtime, and they are very different:
hard-realtime concerns worst case delay (for mission critical systems) -- a late response from the allocator is an error.
soft-realtime concerns either average or median delay -- a late response from the allocator is ok, but shouldn't happen often.
Most state of the art garbage collectors aim for soft-realtime, which is good for desktop applications as well as for servers that deliver services on demand. If one eliminates realtime as a requirement, one might as well use a stop-the-world garbage collector in which headroom begins to lose meaning. (Note: applications with predominantly short-lived objects and a high allocation rate may be an exception, because the survival rate is low.)
Now suppose that we are writing an application that has soft-realtime requirements. For simplicity let's suppose that the GC runs concurrently on a dedicated processor. Suppose the program has the following artificial properties:
mean residency: 1000 KB
reserved headroom: 100 KB
GC cycle duration: 1000 ms
And:
allocation rate A: 100 KB/s
allocation rate B: 200 KB/s
Now we might see the following timeline of events with allocation rate A:
T+0000 ms: GC cycle starts, 100 KB available for allocations, 1000 KB already allocation
T+1000 ms:
0 KB free in reserved space, 1100 KB allocated
GC cycle ends, 100 KB released
100 KB free in reserve, 1000 KB allocated
T+2000 ms: same as above
The timeline of events with allocation rate B is different:
T+0000 ms: GC cycle starts, 100 KB available for allocations, 1000 KB already allocation
T+0500 ms:
0 KB free in reserved space, 1100 KB allocated
either
delay until end of GC cycle (bad, but sometimes mandatory), or
increase reserved size to 200 KB, with 100 KB free (assumed here)
T+1000 ms:
0 KB free in reserved space, 1200 KB allocated
GC cycle ends, 200 KB released
200 KB free in reserve, 1000 KB allocated
T+2000 ms:
0 KB free in reserved space, 1200 KB allocated
GC cycle ends, 200 KB released
200 KB free in reserve, 1000 KB allocated
Notice how the allocation rate directly impacts the size of the headroom required? With allocation rate B, we require twice the headroom to prevent pauses and maintain the same level of performance.
This was a very simplified example designed to illustrate only one idea. There are plenty of other factors, but it does show what was intended. Keep in mind the other major factor I mentioned: average object lifetime. Short lifetimes cause low survival rates, which work together with the allocation rate to influence the amount of memory required to maintain a given level of performance.
In short, one cannot make general claims about the headroom required without knowing and understanding the characteristics of the application.

According to the 2005 study Quantifying the Performance of Garbage Collection vs. Explicit Memory Management (PDF), generational garbage collectors need 5 times the memory to achieve equal performance. The emphasis below is mine:
We compare explicit memory management to both copying and non-copying garbage collectors across a range of benchmarks, and include real (non-simulated) runs that validate our results. These results quantify the time-space tradeoff of garbage collection: with five times as much memory, an Appel-style generational garbage collector with a non-copying mature space matches the performance of explicit memory management. With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection
degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.

I hope the original author clearly marked what they regard as correct usage of garbage collection and the context of their claim.
The overhead certainly depends on many factors; e.g., the overhead is larger if you run your garbage collector less frequently; a copying garbage collector has a higher overhead than a mark and sweep collector; and it is much easier to write a garbage collector with lower overhead in a single-threaded application than in the multi-threaded world, especially for anything that moves objects around (copying and/or compacting gc).

So i wanted to know if theres any research or actually numbers of garbage collection overhead.
Almost 10 years ago I studied two equivalent programs I had written in C++ using the STL (GCC on Linux) and in OCaml using its garbage collector. I found that the C++ used 2x more memory on average. I tried to improve it by writing custom STL allocators but was never able to match the memory footprint of the OCaml.
Furthermore, GCs typically do a lot of compaction which further reduces the memory footprint. So I would challenge the assumption that there is a memory overhead compared to typical unmanaged code (e.g. C++ using what are now the standard library collections).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string