I heard once that for a language to implement and run garbage collection correctly there is on average of 3x more memory required. I am not sure if this is assuming the application is small, large or either.
So i wanted to know if theres any research or actually numbers of garbage collection overhead. Also i want to say GC is a very nice feature.
The amount of memory headroom you need depends on the allocation rate within your program. If you have a high allocation rate, you need more room for growth while the GC works.
The other factor is object lifetime. If your objects typically have a very short lifetime, then you may be able to manage with slightly less headroom with a generational collector.
There are plenty of research papers that may interest you. I'll edit a bit later to reference some.
Edit (January 2011):
I was thinking of a specific paper that I can't seem to find right now. The ones below are interesting and contain some relevant performance data. As a rule of thumb, you are usually ok with about twice as much memory available as your program residency. Some programs need more, but other programs will perform very well even in constrained environments. There are lots of variables that influence this, but allocation rate is the most important one.
Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance
Myths and realities: the performance impact of garbage collection
Edit (February 2013): This edit adds a balanced perspective on a paper cited, and also addresses objections raised by Tim Cooper.
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management, as noted by Natan Yellin, is actually the reference I was first trying to remember back in January 2011. However, I don't think the interpretation Natan has offered is correct. That study does not compare GC against conventional manual memory management. Rather it compares GC against an oracle which does perfect explicit releases. In otherwords, it leaves us not know how well conventional manual memory management compares to the magic oracle. It is also very hard to find this out because the source programs are either written with GC in mind, or with manual memory management in mind. So any benchmark retains in inherent bias.
Following Tim Cooper's objections, I'd like to clarify my position on the topic of memory headroom. I do this mainly for posterity, as I believe Stack Overflow answers should serve as a long-term resource for many people.
There are many memory regions in a typical GC system, but three abstract kinds are:
Allocated space (contains live, dead, and untraced objects)
Reserved space (from which new objects are allocated)
Working region (long-term and short-term GC data structures)
What is headroom anyway? Headroom is the minimum amount of reserved space needed to maintain a desired level of performance. I believe that is what the OP was asking about. You can also think of the headroom as memory additional to the actual program residency (maximum live memory) neccessary for good performance.
Yes -- increasing the headroom can delay garbage collection and increase throughput. That is important for offline non-critical operations.
In reality most problem domains require a realtime solution. There are two kinds of realtime, and they are very different:
hard-realtime concerns worst case delay (for mission critical systems) -- a late response from the allocator is an error.
soft-realtime concerns either average or median delay -- a late response from the allocator is ok, but shouldn't happen often.
Most state of the art garbage collectors aim for soft-realtime, which is good for desktop applications as well as for servers that deliver services on demand. If one eliminates realtime as a requirement, one might as well use a stop-the-world garbage collector in which headroom begins to lose meaning. (Note: applications with predominantly short-lived objects and a high allocation rate may be an exception, because the survival rate is low.)
Now suppose that we are writing an application that has soft-realtime requirements. For simplicity let's suppose that the GC runs concurrently on a dedicated processor. Suppose the program has the following artificial properties:
mean residency: 1000 KB
reserved headroom: 100 KB
GC cycle duration: 1000 ms
And:
allocation rate A: 100 KB/s
allocation rate B: 200 KB/s
Now we might see the following timeline of events with allocation rate A:
T+0000 ms: GC cycle starts, 100 KB available for allocations, 1000 KB already allocation
T+1000 ms:
0 KB free in reserved space, 1100 KB allocated
GC cycle ends, 100 KB released
100 KB free in reserve, 1000 KB allocated
T+2000 ms: same as above
The timeline of events with allocation rate B is different:
T+0000 ms: GC cycle starts, 100 KB available for allocations, 1000 KB already allocation
T+0500 ms:
0 KB free in reserved space, 1100 KB allocated
either
delay until end of GC cycle (bad, but sometimes mandatory), or
increase reserved size to 200 KB, with 100 KB free (assumed here)
T+1000 ms:
0 KB free in reserved space, 1200 KB allocated
GC cycle ends, 200 KB released
200 KB free in reserve, 1000 KB allocated
T+2000 ms:
0 KB free in reserved space, 1200 KB allocated
GC cycle ends, 200 KB released
200 KB free in reserve, 1000 KB allocated
Notice how the allocation rate directly impacts the size of the headroom required? With allocation rate B, we require twice the headroom to prevent pauses and maintain the same level of performance.
This was a very simplified example designed to illustrate only one idea. There are plenty of other factors, but it does show what was intended. Keep in mind the other major factor I mentioned: average object lifetime. Short lifetimes cause low survival rates, which work together with the allocation rate to influence the amount of memory required to maintain a given level of performance.
In short, one cannot make general claims about the headroom required without knowing and understanding the characteristics of the application.
According to the 2005 study Quantifying the Performance of Garbage Collection vs. Explicit Memory Management (PDF), generational garbage collectors need 5 times the memory to achieve equal performance. The emphasis below is mine:
We compare explicit memory management to both copying and non-copying garbage collectors across a range of benchmarks, and include real (non-simulated) runs that validate our results. These results quantify the time-space tradeoff of garbage collection: with five times as much memory, an Appel-style generational garbage collector with a non-copying mature space matches the performance of explicit memory management. With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection
degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.
I hope the original author clearly marked what they regard as correct usage of garbage collection and the context of their claim.
The overhead certainly depends on many factors; e.g., the overhead is larger if you run your garbage collector less frequently; a copying garbage collector has a higher overhead than a mark and sweep collector; and it is much easier to write a garbage collector with lower overhead in a single-threaded application than in the multi-threaded world, especially for anything that moves objects around (copying and/or compacting gc).
So i wanted to know if theres any research or actually numbers of garbage collection overhead.
Almost 10 years ago I studied two equivalent programs I had written in C++ using the STL (GCC on Linux) and in OCaml using its garbage collector. I found that the C++ used 2x more memory on average. I tried to improve it by writing custom STL allocators but was never able to match the memory footprint of the OCaml.
Furthermore, GCs typically do a lot of compaction which further reduces the memory footprint. So I would challenge the assumption that there is a memory overhead compared to typical unmanaged code (e.g. C++ using what are now the standard library collections).
Related
I have been trying to profile the heap usage of a cli tool built with cobra.
The pprof tool is showing like the following,
Flat Flat% Sum% Cum Cum% Name Inlined?
1.58GB 49.98% 49.98% 1.58GB 49.98% os.ReadFile
1.58GB 49.98% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.(*frozenConfig).Unmarshal
0 0.00% 99.95% 3.16GB 100.00% runtime.main
0 0.00% 99.95% 3.16GB 100.00% main.main
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).execute
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).ExecuteC
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).Execute (inline)
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/misc.ParseUcpNodesInspect
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.glob..func3
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.getInfos
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.Execute
0 0.00% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.Unmarshal
But ps is sowing at the end it almost consumes 6752.23 Mb(rss).
Also, I am putting the defer profile.Start(profile.MemProfileHeap).Stop() at the last function gets executed. Putting the profiler in the func main doesn't show anything. So I traced through the functions and found the considerable usage of memory at the last one.
My question is, how do I find the missing ~3gb of memory?
There are multiple problems (with your question):
ps (and top etc) show multiple memory readings. The only one of interest is typically called RES or RSS. You don't tell which one it was.
Basically, looking at the reading typically named VIRT is not interesting.
As Volker said, pprof does not measure memory consumption, it measures (in the mode you have run it) memory allocation rate—in the sense of "how much", not "how frequently".
To understand what it means, consider how pprof works.
During profiling, a timer ticks, and on each tick, the profiler sort of snaphots your running program, scans the stacks of all live goroutines and attributes live objects on the heap to the variables contained in the stack frames of those stacks, and each stack frame belongs to an active function.
This means that, if your process will call, say, os.ReadFile—which, by its contract, allocates a slice of bytes long enough to contain the whole contents of the file to be read,—100 times to read 1 GiB file each time, and the profiler's timer will manage to pinpoint each of these 100 calls (it can miss some of the calls as it's sampling), os.ReadFile will be attributed to having had allocated 100 GiB.
But if your program is not written in such a way that it holds each of the slices returned by these calls, but rather does something with those slices and throws them away after processing, the slices from the past calls will likely be already collected by the GC by the time the newer ones are allocated.
While not required by the spec, the two "standard" contemporary implementations of Go—the one originally dubbed "gc", which most people think is the implementation, and the GCC frontend—feature garbage collector which runs in parallel with the flow of your own process; the moments it actually collects the garbage produced by your process are governed by a set of complicated heuristics (start here if interested) which try to balance between spending CPU time for GC and spending RAM for not doing it ;-) , and it means for short-lived processes, the GC might not kick in even a single time, meaning your process will end with all the generated garbage still floating, and all that memory will be reclaimed by the OS in the usual way when the process ends.
When the GC collects garbage, the freed memory is not returned to the OS immediately. Instead, two-staged process is involved:
First, the freed regions are returned to the memory manager which is a part of the Go rutime powering your running program.
This is a sensible thing because in a typical program memory churn is usually high enough and freed memory will likely be quickly allocated back again.
Second, memory pages staying free long enough are marked to let the OS know it can use it for its own needs.
Basically it means that even after the GC frees some memory, you won't see this outside the running Go process as this memory is first retuned to the process' own pool.
Different versions of Go (again, I mean the "gc" implementation) implemented different policies about returning the freed pages to the OS: first they were marked by madvise(2) as MADV_FREE, then as MADV_DONTNEED and then again as MADV_FREE.
If you happen to use a version of Go whose runtime marks freed memory as MADV_DONTNEED, the readings of RSS will be even less sensible because the memory marked that way still counts against the process' RSS even though the OS was hinted it can reclaim that memory when needed.
To recap.
This topic is complex enough and you appear to be drawing certain conclusions too fast ;-)
An update.
I've decided to expand on memory management a bit because I feel like certain bits and pieces may be missing from the big picture of this stuff in your head, and because of this you might find the comments to your question to be moot and dismissive.
The reasoning for the advice to not measure memory consumption of programs written in Go using ps, top and friends is rooted in the fact the memory management implemented in the runtime environments powering programs written in contemporary high-level programming languages is quite far removed from the down-to-the-metal memory management implemented in the OS kernels and the hardware they run on.
Let's consider Linux to have concrete tangible examples.
You certainly can ask the kernel directly to allocate a memory for you: mmap(2) is a syscall which does that.
If you call it with MAP_PRIVATE (and usually also with MAP_ANONYMOUS), the kernel will make sure the page table of your process has one or more new entries for as many pages of memory to contain the contiguous region of as many bytes as you have requested, and return the address of the first page in the sequence.
At this time you might think that the RSS of your process had grown by that number of bytes, but it had not: the memory was "reserved" but not actually allocated; for a memory page to really get allocated, the process had to "touch" any byte within the page—by reading it or writing it: this will generate the so-called "page fault" on the CPU, and the in-kernel handler will ask the hardware to actually allocate a real "hardware" memory page. Only after that the page will actually count against the process' RSS.
OK, that's fun, but you probably can see a problem: it's not too convenient to operate with complete pages (wich can be of different size on different systems; typically it's 4 KiB on systems of the x86 lineage): when you program in a high-level language, you don't think on such a low level about the memory; instead, you expect the running program to somehow materialize "objects" (I do not mean OOP here; just pieces of memory containing values of some language- or user-defined types) as you need them.
These objects may be of any size, most of the time way smaller than a single memory page, and—what is more important,—most of the time you do not even think about how much space these objects are consuming when allocated.
Even when programming in a language like C, which these days is considered to be quite low-level, you're usually accustomed to using memory management functions in the malloc(3) family provided by the standard C library, which allow you to allocate regions of memory of arbitrary size.
A way to solve this sort of problem is to have a higher-level memory manager on top on what the kernel can do for your program, and the fact is, every single general-purpose program written in a high-level language (even C and C++!) is using one: for interpreted languages (such as Perl, Tcl, Python, POSIX shell etc) it is provided by the interpreter; for byte-compiled languages such as Java, it is provided by the process which executes that code (such as JRE for Java); for languages which compile down to machine (CPU) code—such as the "stock" implementation of Go—it is provided by the "runtime" code included into the resulting executable image file or linked into the program dynamically when it's being loaded into the memory for execution.
Such memory managers are usually quite complicated as they have to deal with many complex problems such as memory fragmentation, and they usually have to avoid talking to the kernel as much as possible because syscalls are slow.
The latter requirement naturally means process-level memory managers try to cache the memory they have once taken from the kernel, and are reluctant to release it back.
All this means that, say, in a typical active Go program you might have crazy memory churn — hordes of small objects being allocated and deallocated all the time which has next to no effect on the values of RSS monitored "from the outside" of the process: all this churn is handled by the in-process memory manager and—as in the case of the stock Go implementation—the GC which is naturally tightly integrated with the MM.
Because of that, to have useful actionable idea about what is happening in a long-running production-grade Go program, such program usually provides a set of continuously updated metrics (delivering, collecting and monitoring them is called telemetry). For Go programs, a part of the program tasked with producing these metrics can either make periodic calls to runtime.ReadMemStats and runtime/debug.ReadGCStats or directly use what the runtime/metrics has to offer. Looking at such metrics in a monitoring system such as Zabbix, Graphana etc is quite instructive: you can literally see how the amount of free memory available to the in-process MM increases after each GC cycle while the RSS stays roughly the same.
Also note that you might consider running your Go program with various GC-related debugging settings in a special environment variable GODEBUG described here: basically, you make the Go runtime powering your running program emit detailed information on how the GC is working (also see this).
Hope this will make your curious to make further exploration of these matters ;-)
You might find this to be a good introduction on memory management implemented by the Go runtime—in connection with the kernel and the hardware; recommended read.
When I read the book 《The Garbage Collection HandBook》, the chapter 9
impile that:"object lifetimes are better measured by the number of bytes of heap space allocated between their birth and death.". I am not very understand this sentence. why lifetime can be measured by the allocated bytes? I try to google for that, but I get no answer.
Who can explain that to me? Thanks!
By measuring object lifetimes in terms of bytes allocated between instantiation and death, it is easier for the GC algorithm to adapt to program behaviour.
If the rate of object allocation is very slow, a simple time measurement would show long pauses between collections, which would appear to be good. However, if the byte allocation measurement of object lifetimes is high objects may be getting promoted to a survivor space or the old generation too quickly. By measuring the byte allocation the collector could optimise heap sizes more efficiently by expanding the young generation to increase the number of objects that become garbage before a minor collection occurs. Just using time as this measure would not make the need for the heap resizing obvious.
As the book points out, with multi-threaded applications it is hard to measure byte allocation for individual threads so collectors tend to measure lifetimes in terms of how many collections an object survives. This is a simpler number to monitor and requires less space to record.
“time” is only a scale that allows to bring an order to events. There are many possible units, even in the real world. Inside the computer, for the purpose of garbage collection, there is no real world’s time unit needed, all the garbage collector usually wants to know, is, which object is older than the other.
For this purpose, just assigning an ascending number to each allocated object would be sufficient, but this would imply maintaining an additional counter. In contrast, the number of allocated bytes comes for free. It’s important that we accumulate the allocated bytes only, never subtracting deallocated bytes, so we have an always growing number.
In a generational memory management, this number doesn’t need to be updated on every allocation, as objects are allocated continuously in a dedicated space, so their addresses represent their relative age within this memory region whereas the start of the region is associated with the last garbage collection. Only when the garbage collector runs and moves the surviving objects, it has to merge this information into an absolute age, if needed.
Implementations like the HotSpot JVM simplify this further. For surviving objects, it maintains a small counter holding the number of garbage collection cycles it survived. After having survived a configurable number of collection cycles, it gets promoted to the old generation and beyond that point, the object’s age becomes irrelevant.
Im using java melody to monitor memory usage in production environment.
The requirement is memory not should exceed 256MB/512MB .
I have done maximum of code optimized but still the usage is 448MB/512MB but when i executed garbage collector in java melody manually the memory consumption is 109MB/512MB.
You can invoke the garbage collection using one of these two calls in your code (they're equivalent):
System.gc();
Runtime.getRuntime().gc();
It's better if you place the call in a Runnable that gets invoked periodically, depending on how fast your memory limit is reached.
Why you actually care about heap usage? as long as you set XMS (maximum heap) you are fine. Let java invoke GC when it seems fit. As long as you have free heap it is no point doing GC and freeing heap just for sake of having a lot of free heap.
If you want to limit memory allocated by process XMX is not enough. You should also limit native memory.
What you should care about is
Memory leaks, consecutive Full GCs, GC starvation
GC KPIs: Latency, throughput, footprint
Object creation rate, promotion rate, reclamation rate…
GC Pause time statistics: Duration distribution, average, count, average interval, min/max, standard deviation
GC Causes statistics: Duration, Percentage, min/max, total
GC phases related statistics: Each GC algorithm has several sub-phases. Example for G1: initial-mark, remark, young, full, concurrent mark, mixed
See https://blog.gceasy.io/2017/05/30/improving-your-performance-reports/ https://blog.gceasy.io/2017/05/31/gc-log-analysis-use-cases/ for more technical details. You could also analyze your GC logs using https://blog.gceasy.io/ it will help you understand how your JVM is using memory.
Context: 64 bit Oracle Java SE 1.8.0_20-b26
For over 11 hours, my running java8 app has been accumulating objects in the Tenured generation (close to 25%). So, I manually clicked on the Perform GC button in jconsole and you can see the precipitous drop in heap memory on the right of the chart. I don't have any special VM options turned on except for XX:NewRatio=2.
Why does the GC not clean up the tenured generation ?
This is a fully expected and desirable behavior. The JVM has been successfully avoiding a Major GC by performing timely Minor GC's all along. A Minor GC, by definition, does not touch the Tenured Generation, and the key idea behind generational garbage collectors is that precisely this pattern will emerge.
You should be very satisfied with how your application is humming along.
The throughput collector's primary goal is, as its name says, throughput (via GCTimeRatio). Its secondary goal is pause times (MaxGCPauseMillis). Only as tertiary goal it considers keeping the memory footprint low.
If you want to achieve a low heap size you will have to relax the other two goals.
You may also want to lower MaxHeapFreeRatio to allow the JVM to yield back memory to the OS.
Why does the GC not clean up the tenured generation ?
Because it doesn't need to.
It looks like your application is accumulating tenured garbage at a relatively slow rate, and there was still plenty of space for tenured objects. The "throughput" collector generally only runs when a space fills up. That is the most efficient in terms of CPU usage ... which is what the throughput collector optimizes for.
In short, the GC is working as intended.
If you are concerned by the amount of memory that is being used (because the tenured space is not being collected), you could try running the application with a smaller heap. However, the graph indicates that the application's initial behavior may be significantly different to its steady-state behavior. In other words, your application may require a large heap to start with. If that is the case, then reducing the heap size could stop the application working, or at least make the startup phase a lot slower.
I'm benchmarking the memory consumption of a haskell programm compiled with GHC. In order to do so, I run the programm with the following command line arguments: +RTS -t -RTS. Here's an example output:
<<ghc: 86319295256 bytes, 160722 GCs, 53963869/75978648 avg/max bytes residency (386 samples), 191M in use, 0.00 INIT (0.00 elapsed), 152.69 MUT (152.62 elapsed), 58.85 GC (58.82 elapsed) :ghc>>.
According to the ghc manual, the output shows:
The total number of bytes allocated by the program over the whole run.
The total number of garbage collections performed.
The average and maximum "residency", which is the amount of live data in bytes. The runtime can only determine the amount of live data during a major GC, which is why the number of samples corresponds to the number of major GCs (and is usually relatively small).
The peak memory the RTS has allocated from the OS.
The amount of CPU time and elapsed wall clock time while initialising the runtime system (INIT), running the program itself (MUT, the mutator), and garbage collecting (GC).
Applied to my example, it means that my program shuffles 82321 MiB (bytes divided by 1024^2) around, performs 160722 garbage collections, has a 51MiB/72MiB average/maximum memory residency, allocates at most 191M memory in RAM and so on ...
Now I want to know, what »The average and maximum "residency", which is the amount of live data in bytes« is compared to »The peak memory the RTS has allocated from the OS«? And also: What uses the remaining space of roughly 120M?
I was pointed here for more information, but that does not state clearly, what I want to know. Another source (5.4.4 second item) hints that the 120M memory is used for garbage collection. But that is too vague – I need a quotable information source.
So please, is there anyone who could answer my questions with good sources as proofs?
Kind regards!
The "resident" size is how much live Haskell data you have. The amount of memory actually allocated from the OS may be higher.
The RTS allocates memory in "blocks". If your program needs 7.3 blocks of of RAM, the RTS has to allocate 8 blocks, 0.7 of which is empty space.
The default garbage collection algorithm is a 2-space collector. That is, when space A fills up, it allocates space B (which is totally empty) and copies all the live data out of space A and into space B, then deallocates space A. That means that, for a while, you're using 2x as much RAM as is actually necessary. (I believe there's a switch somewhere to use a 1-space algorithm which is slower but uses less RAM.)
There is also some overhead for managing threads (especially if you have lots), and there might be a few other things.
I don't know how much you already know about GC technology, but you can try reading these:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
http://www.mm-net.org.uk/workshop190404/GHC%27s_Garbage_Collector.ppt