Difference between background and concurrent garbage collection? - garbage-collection

I read that with .NET Framework 4 the current garbage collection implementation is replaced:
The .NET Framework 4 provides
background garbage collection. This
feature replaces concurrent garbage
collection in previous versions and
provides better performance.
At this page there is an explanation how it works but I am not sure I understood it.
In practical world application what is the benefit of this new GC implementation? Is it a feature that could be use to push for a transition from 3.5 or previous to 4.0?

Here, Microsoft uses the names "concurrent" and "background" to describe two versions of the GC it uses in .NET. In the .NET world, the "background collector" is an enhancement over the "concurrent collector" in that it has less restrictions on what application threads can do while the collector is running.
A basic GC uses a "stop-the-world" strategy: applicative threads allocate memory blocks from a common heap. When the GC must run (e.g. too many blocks have been allocated, some cleanup is needed), all applicative (managed) threads stop. The last stopping thread runs the GC, and unblocks all the other threads when it has finished. A stop-the-world GC is simple to implement but induces pauses which can be perceptible at the user level.
Microsoft's "concurrent GC" is generational: it uses the stop-the-world strategy for only a limited part of the heap (what they call "generations 0 and 1"). Since that part remains small, pauses remain short (e.g. below 50ms), so that the user will not notice them. The rest of the heap is collected with a dedicated GC thread, which can run concurrently with the applicative threads (hence the name).
The concurrent GC has some limitations. Namely, there are moments when the GC thread must assume a somewhat exclusive control of the heap. During such times, applicative threads may allocate blocks only from small thread-specific areas. Threads which have bigger needs will soon stumble upon the main heap, which, at that time, is locked by the GC thread. The allocating thread must then block until the GC thread has finished its lock-the-heap phase. This again induces pauses. Less pauses than with a stop-the-world GC, and these pauses do not affect all threads. Yet pauses nonetheless.
The "background GC" is an enhanced GC in which the GC thread needs not lock the heap. This removes the extra pauses described in the previous paragraph; only remain the limited pauses when the young generations are collected (what Microsoft calls "a foreground collection").
Note: there are "hidden costs" with the concurrent GC and the background GC. For these GC to operate properly, memory accesses from applicative threads must be done in some very specific ways, which have a slight impact on performance. Also, the GC thread may have an adverse effect on cache memory, thus indirectly degrading performance. For a purely computational task with no need for user interaction, a stop-the-world collector may, on average, yield somewhat better performance (e.g. a twenty-hours-long computation will complete in nineteen hours). But this is an edge case, and in most situations the concurrent and background GC are better.

Here is the real world explanation without slur and overinflated feeling of self-importance:
In concurrent GC you were allowed to allocate while in a GC, but you are not allowed to start another GC while in a GC. This in turn means that the maximum you are allowed to allocate while in a GC is whatever space you have left on one segment (currently 16 MB in workstation mode) minus anything that is already allocated there).
The difference in Background mode is that you are allowed to start a new GC (gen 0+1) while in a full background GC, and this allows you to even create a new segment to allocate in if necessary. In short, the blocking that could occur before when you allocated all you could in one segment won’t happen anymore.
From Tess da Man! http://blogs.msdn.com/b/tess/archive/2009/05/29/background-garbage-collection-in-clr-4-0.aspx

The primary benefit will be fewer application freezes due to garbage collection, which in itself could be considered a significant improvement. For most apps this difference will not be noticeable unless you have a HUGE number of long-lived objects in memory.
This change also makes .NET slightly more viable for building timing-sensitive apps (where response times are important). The extreme example are car airbags - you don't want your software to be busy doing garbage collection when they need to be inflated. The changes in 4.0 reduce the number and length of freezes due to GCing but does not remove them entirely.

Related

Will Julia's GC continue to stop-the-world when parallelism is introduction?

Julia will have multi-threading soon. I'm curious to know the impact on its GC algorithm.
At one point in the thread, Stefan says that "Memory allocation will always be thread-local." If Julia will always use thread-local storage, doesn't that mean it could GC on threads independently, preventing most/all stop-the-world scenarios, similar to Erlang's BEAM?
Initially it will still be stop-the-world: there will be a barrier stopping all threads, the threads all mark in parallel, then there will be another barrier and the threads all sweep in parallel; as soon as they are done they can continue without further synchronization. In the future, however, there could be more concurrent GC implementations, maybe even as the default. That would be a pretty significant bit of work to implement, however.

Why does GC(Garbage collector) freezes current execution threads

I was reading Chapter 12: Garbage collection of C# in a Nutshell where in the section about Concurrent and background collection it says that
The GC must freeze (block) your execution threads for periods during a
collection. This includes the entire period during which a Gen0 or
Gen1 collection takes place.
One thing I understand is that; probably it's trying to avoid any new memory allocation at that point of time.
Is there any other specific reason behind this - as why GC need to block currently executing thread?
The MSDN documentation claims that generations 0 and 1 are always performed non-concurrently because they happen very fast.
Performing a concurrent garbage collection pass will take longer than a non-concurrent one since access to data that is being processed must be synchronized between the GC thread and other threads. This adds overhead which probably outweighs the benefits of concurrency in gen 0 and 1 collections since they typically run very fast.
Beyond removing objects that are marked from memory, the GC also tries to compact the heap after performing a pass. This means that objects may move in memory as a result of a GC pass. For this reason a concurrent pass requires the extra overhead to synchronize data access between the GC thread and other threads of the process.

Java 7 G1GC strange behaviour

Recently I have tried to use G1GC from jdk1.7.0-17 in my java processor which is processing a lot of similar messages received from an MQ (about 15-20 req/sec). Every message is processed in the separate thread (about 100 threads in stable state) that serviced by Java limited thread pool. Surprisingly, I detected the strange behaviour - as soon as GC starts the full gc cycle it begins to use significant processing time (up to 100% CPU and even more). I was doing refactoring of the code several times having a goal to optimizing it and doing it more lightweight. But without any significant result - the behaviour is the same. I use the 4-core 64-bit machine with Debian OS (2.6.32-5 kernel). May someone help me to understand and resolve the situation?
Below are depicted some illustrations for listed above issue.
Surprisingly, I detected the strange behaviour - as soon as GC starts
the full gc cycle...
Unfortunately, this is not a surprise because for the G1 GC implemented within the JVM uses just one hardware thread (vCPU) to execute the Full GC so the idea is to minimize the number of Full GCs. Please, you should keep in mind this collector is recommended for configurations with several cores (of course it does not impact on the Full GC, but impacts on allocation and parallel collections) and big heaps I think bigger than 8GB.
According to Oracle:
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html
The Garbage-First (G1) garbage collector is a server-style garbage
collector, targeted for multiprocessor machines with large memories.
It attempts to meet garbage collection (GC) pause time goals with high
probability while achieving high throughput. Whole-heap operations,
such as global marking, are performed concurrently with the
application threads. This prevents interruptions proportional to heap
or live-data size.
In this article there is an explanation about the Full GC single thread in this collector.
https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector
Finally and unfortunately, G1 also has to deal with the dreaded Full
GC. While G1 is ultimately trying to avoid Full GC’s, they are still a
harsh reality especially in improperly tuned environments. Given that
G1 is targeting larger heap sizes, the impact of a Full GC can be
catastrophic to in-flight processing and SLAs. One of the primary
reasons is that Full GCs are still a single-threaded operation in G1.
Looking at causes, the first, and most avoidable, is related to
Metaspace.
By the way, it seems to be the newest version of Java (10) is going to include a G1 with the capability of executing Full GCs in parallel.
https://www.opsian.com/blog/java-10-with-g1/
Java 10 reduces Full GC pause times by iteratively improving on its
existing algorithm. Until Java 10 G1 Full GCs ran in a single thread.
That’s right - your 32 core server and it’s 128GB will stop and pause
until a single thread takes out the garbage.
Perhaps, you should tune the metaspace or increase the heap or you can use other collector such as the parallel GC.

What kind of Garbage Collection does Go use?

Go is a garbage collected language:
http://golang.org/doc/go_faq.html#garbage_collection
Here it says that it's a mark-and-sweep garbage collector, but it doesn't delve into details, and a replacement is in the works... yet, this paragraph seems not to have been updated much since Go was released.
It's still mark-and-sweep? Is it conservative or precise? Is it generational?
Plans for Go 1.4+ garbage collector:
hybrid stop-the-world/concurrent collector
stop-the-world part limited by a 10ms deadline
CPU cores dedicated to running the concurrent collector
tri-color mark-and-sweep algorithm
non-generational
non-compacting
fully precise
incurs a small cost if the program is moving pointers around
lower latency, but most likely also lower throughput, than Go 1.3 GC
Go 1.3 garbage collector updates on top of Go 1.1:
concurrent sweep (results in smaller pause times)
fully precise
Go 1.1 garbage collector:
mark-and-sweep (parallel implementation)
non-generational
non-compacting
mostly precise (except stack frames)
stop-the-world
bitmap-based representation
zero-cost when the program is not allocating memory (that is: shuffling pointers around is as fast as in C, although in practice this runs somewhat slower than C because the Go compiler is not as advanced as C compilers such as GCC)
supports finalizers on objects
there is no support for weak references
Go 1.0 garbage collector:
same as Go 1.1, but instead of being mostly precise the garbage collector is conservative. The conservative GC is able to ignore objects such as []byte.
Replacing the GC with a different one is controversial, for example:
except for very large heaps, it is unclear whether a generational GC would be faster overall
package "unsafe" makes it hard to implement fully precise GC and compacting GC
(For Go 1.8 - Q1 2017, see below)
The next Go 1.5 concurrent Garbage Collector involve being able to "pace" said gc.
Here is a proposal presented in this paper which might make it for Go 1.5, but also helps understand the gc in Go.
You can see the state before 1.5 (Stop The World: STW)
Prior to Go 1.5, Go has used a parallel stop-the-world (STW) collector.
While STW collection has many downsides, it does at least have predictable and controllable heap growth behavior.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
The sole tuning knob for the STW collector was “GOGC”, the relative heap growth between collections. The default setting, 100%, triggered garbage collection every time the heap size doubled over the live heap size as of the previous collection:
GC timing in the STW collector.
Go 1.5 introduces a concurrent collector.
This has many advantages over STW collection, but it makes heap growth harder to control because the application can allocate memory while the garbage collector is running.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
To achieve the same heap growth limit the runtime must start garbage collection earlier, but how much earlier depends on many variables, many of which cannot be predicted.
Start the collector too early, and the application will perform too many garbage collections, wasting CPU resources.
Start the collector too late, and the application will exceed the desired maximum heap growth.
Achieving the right balance without sacrificing concurrency requires carefully pacing the garbage collector.
GC pacing aims to optimize along two dimensions: heap growth, and CPU utilized by the garbage collector.
The design of GC pacing consists of four components:
an estimator for the amount of scanning work a GC cycle will require,
a mechanism for mutators to perform the estimated amount of scanning work by the time heap allocation reaches the heap goal,
a scheduler for background scanning when mutator assists underutilize the CPU budget, and
a proportional controller for the GC trigger.
The design balances two different views of time: CPU time and heap time.
CPU time is like standard wall clock time, but passes GOMAXPROCS times faster.
That is, if GOMAXPROCS is 8, then eight CPU seconds pass every wall second and GC gets two seconds of CPU time every wall second.
The CPU scheduler manages CPU time.
The passage of heap time is measured in bytes and moves forward as mutators allocate.
The relationship between heap time and wall time depends on the allocation rate and can change constantly.
Mutator assists manage the passage of heap time, ensuring the estimated scan work has been completed by the time the heap reaches the goal size.
Finally, the trigger controller creates a feedback loop that ties these two views of time together, optimizing for both heap time and CPU time goals.
This is the implementation of the GC:
https://github.com/golang/go/blob/master/src/runtime/mgc.go
From the docs in the source:
The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is non-generational and non-compacting. Allocation is done using size segregated per P allocation areas to minimize fragmentation while eliminating locks in the common case.
Go 1.8 GC might evolve again, with the proposal "Eliminate STW stack re-scanning"
As of Go 1.7, the one remaining source of unbounded and potentially non-trivial stop-the-world (STW) time is stack re-scanning.
We propose to eliminate the need for stack re-scanning by switching to a hybrid write barrier that combines a Yuasa-style deletion write barrier [Yuasa '90] and a Dijkstra-style insertion write barrier [Dijkstra '78].
Preliminary experiments show that this can reduce worst-case STW time to under 50µs, and this approach may make it practical to eliminate STW mark termination altogether.
The announcement is here and you can see the relevant source commit is d70b0fe and earlier.
I'm not sure, but I think the current (tip) GC is already a parallel one or at least it's a WIP. Thus the stop-the-world property doesn't apply any more or will not in the near future. Perhaps someone other can clarify this in more detail.

What's the Gambit-C's GC mechanism?

What's the Gambit-C's GC mechanism? I'm curious about this for making interactive app. I want to know whether it can avoid burst GC operation or not.
According to these threads:
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2005-December/000521.html
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2008-September/002645.html
Gambit has traditional stop-the-world GC at least until September 2008. People in thread recommended using pre-allocated object pooling to avoid GC operation itself. I couldn't find out about current implementation.
*It's hard to agree with the conversation. Because I can't pool object not written by myself and finally full-GC will happen at sometime by accumulated small/non-pooled temporary objects. But the method mentioned by #Gregory may help to avoid this problem. However, I wish incremental GC added to Gambit :)
According to http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Debugging#Garbage_collection_threshold gambit has some controls:
Garbage collection threshold
Pay attention to the runtime options h (maximum heapsize in kilobytes) and l (livepercent). See the reference manual for more information. Setting livepercent to five means that garbage collection will take place at the time that there are nineteen times more memory allocated for objects that should be garbage collected, than there is memory allocated for objects that should not. The reason the livepercent option is there, is to give a way to control how sparing/generous the garbage collector should be about memory consumption, vs. how heavy/light it should be in CPU load.
You can always force garbage collection by (##gc).
If you force garbage collection after some small number of operations, or schedule it near continuously, or set the livepercent to like 90 then presumably the gc will run frequently and not do very much on each run. This is likely to be more expensive overall, but avoid bursts of expense. You can then fairly easily budget for that expense to make the service fast despite.

Resources