What's the Gambit-C's GC mechanism? - garbage-collection

What's the Gambit-C's GC mechanism? I'm curious about this for making interactive app. I want to know whether it can avoid burst GC operation or not.

According to these threads:
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2005-December/000521.html
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2008-September/002645.html
Gambit has traditional stop-the-world GC at least until September 2008. People in thread recommended using pre-allocated object pooling to avoid GC operation itself. I couldn't find out about current implementation.
*It's hard to agree with the conversation. Because I can't pool object not written by myself and finally full-GC will happen at sometime by accumulated small/non-pooled temporary objects. But the method mentioned by #Gregory may help to avoid this problem. However, I wish incremental GC added to Gambit :)

According to http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Debugging#Garbage_collection_threshold gambit has some controls:
Garbage collection threshold
Pay attention to the runtime options h (maximum heapsize in kilobytes) and l (livepercent). See the reference manual for more information. Setting livepercent to five means that garbage collection will take place at the time that there are nineteen times more memory allocated for objects that should be garbage collected, than there is memory allocated for objects that should not. The reason the livepercent option is there, is to give a way to control how sparing/generous the garbage collector should be about memory consumption, vs. how heavy/light it should be in CPU load.
You can always force garbage collection by (##gc).
If you force garbage collection after some small number of operations, or schedule it near continuously, or set the livepercent to like 90 then presumably the gc will run frequently and not do very much on each run. This is likely to be more expensive overall, but avoid bursts of expense. You can then fairly easily budget for that expense to make the service fast despite.

Related

Why is garbage collection necessary?

Suppose that an object on the heap goes out of scope. Why can't the program free the memory right after the scope ends? Or, if we have a pointer to an object that is replaced by the address to a new object, why can't the program deallocate the old one before assigning the new one? I'm guessing that it's faster not to free it immediately and instead have the freeing be done asynchronously at a later point in time, but I'm not really sure.
Why is garbage collection necessary?
It is not strictly necessary. Given enough time and effort you can always translate a program that depends on garbage collection to one that doesn't.
In general, garbage collection involves a trade-off.
On the one hand, garbage collection allows you to write an application without worrying about the details of memory allocation and deallocation. (And the pain of debugging crashes and memory leaks caused by getting the deallocation logic wrong.)
The downside of garbage collection is that you need more memory. A typical garbage collector is not efficient if it doesn't have plenty of spare space1.
By contrast, if you do manual memory management, you can code your application to free up heap objects as soon as they are no longer used. Furthermore, you don't get awkward "pauses" while the GC is doing its thing.
The downside of manual memory management is that you have to write the code that decides when to call free, and you have to get it correct. Furthermore, if you try to manage memory by reference counting:
you have the cost of incrementing and decrementing ref counts whenever pointers are assign or variables go out of scope,
you have to deal with cycles in your data structures, and
it is worse when your application is multi-threaded and you have to deal with memory caches, synchronization, etc.
For what it is worth, if you use a decent garbage collector and tune it appropriately (e.g. give it enough memory, etc) then the CPU costs of GC and manual storage management are comparable when you apply them to a large application.
Reference:
"The measured cost of conservative garbage collection" by Benjamin Zorn
1 - This is because the main cost of a modern collector is in traversing and dealing with the non-garbage objects. If there is not a lot of garbage because you are being miserly with the heap space, the GC does a lot of work for little return. See https://stackoverflow.com/a/2414621/139985 for an analysis.
It's more complicated, but
1) what if there is memory pressure before the scope is over? Scope is only a language notion, not related to reachability. So an object can be "freed" before it goes out of scope ( java GCs do that on regular basis). Also, if you free objects after each scope is done, you might be doing too little work too often
2) As far as the references go, you are not considering that the reference might have hierarchies and when you change one, there has to be code that traverses those. It might not be the right time to do it when that happens.
In general, there is nothing wrong with such a proposal that you describer, as a matter of fact this is almost exactly how Rust programming language works, from a high level point of view.

Garbage collector in Node.js

According to google, V8 uses an efficient garbage collection by employing a "stop-the-world, generational, accurate, garbage collector". Part of the claim is that the V8 stops program execution when performing a garbage collection cycle.
An obvious question is how can you have an efficient GC when you pause program execution?
I was trying to find more about this topic as I would be interested to know how does the GC impacts the response time when you have possibly tens of thounsands requests per second firing your node.js server.
Any expert help, personal experience or links would be greatly appreciated
Thank you
"Efficient" can mean several things. Here it probably refers to high throughput. When looking at response time, you're more interested in latency, which could indeed be worse than with alternative GC strategies.
The main alternatives to stop-the-world GCs are
incremental GCs, which need not finish a collection cycle before handing back control to the mutator1 temporarily, and
concurrent GCs which (virtually) operate at the same time as the mutator, interrupting it only very briefly (e.g. to scan the stack).
Both need to perform extra work to be correct in the face of concurrent modification of the heap (e.g. if a new object is created and attached to an already-scanned object, this new reference must be noticed). This impacts total throughput, i.e., it takes longer to actually clean the entire heap. The upside is that they do not (usually) interrupt the program for very long, if at all, so latency is low(er).
Although the V8 documentation still mentions a stop-the-world collector, it seems that an the V8 GC is incremental since 2011. So while it does stop program execution once in a while, it does not 2 stop the program for however long it takes to scan the entire heap. Instead it can scan for, say, a couple milliseconds, and let the program resume.
1 "Mutator" is GC terminology for the program whose heap is garbage collected.
2 At least in principle, this is probably configurable.

What kind of Garbage Collection does Go use?

Go is a garbage collected language:
http://golang.org/doc/go_faq.html#garbage_collection
Here it says that it's a mark-and-sweep garbage collector, but it doesn't delve into details, and a replacement is in the works... yet, this paragraph seems not to have been updated much since Go was released.
It's still mark-and-sweep? Is it conservative or precise? Is it generational?
Plans for Go 1.4+ garbage collector:
hybrid stop-the-world/concurrent collector
stop-the-world part limited by a 10ms deadline
CPU cores dedicated to running the concurrent collector
tri-color mark-and-sweep algorithm
non-generational
non-compacting
fully precise
incurs a small cost if the program is moving pointers around
lower latency, but most likely also lower throughput, than Go 1.3 GC
Go 1.3 garbage collector updates on top of Go 1.1:
concurrent sweep (results in smaller pause times)
fully precise
Go 1.1 garbage collector:
mark-and-sweep (parallel implementation)
non-generational
non-compacting
mostly precise (except stack frames)
stop-the-world
bitmap-based representation
zero-cost when the program is not allocating memory (that is: shuffling pointers around is as fast as in C, although in practice this runs somewhat slower than C because the Go compiler is not as advanced as C compilers such as GCC)
supports finalizers on objects
there is no support for weak references
Go 1.0 garbage collector:
same as Go 1.1, but instead of being mostly precise the garbage collector is conservative. The conservative GC is able to ignore objects such as []byte.
Replacing the GC with a different one is controversial, for example:
except for very large heaps, it is unclear whether a generational GC would be faster overall
package "unsafe" makes it hard to implement fully precise GC and compacting GC
(For Go 1.8 - Q1 2017, see below)
The next Go 1.5 concurrent Garbage Collector involve being able to "pace" said gc.
Here is a proposal presented in this paper which might make it for Go 1.5, but also helps understand the gc in Go.
You can see the state before 1.5 (Stop The World: STW)
Prior to Go 1.5, Go has used a parallel stop-the-world (STW) collector.
While STW collection has many downsides, it does at least have predictable and controllable heap growth behavior.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
The sole tuning knob for the STW collector was “GOGC”, the relative heap growth between collections. The default setting, 100%, triggered garbage collection every time the heap size doubled over the live heap size as of the previous collection:
GC timing in the STW collector.
Go 1.5 introduces a concurrent collector.
This has many advantages over STW collection, but it makes heap growth harder to control because the application can allocate memory while the garbage collector is running.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
To achieve the same heap growth limit the runtime must start garbage collection earlier, but how much earlier depends on many variables, many of which cannot be predicted.
Start the collector too early, and the application will perform too many garbage collections, wasting CPU resources.
Start the collector too late, and the application will exceed the desired maximum heap growth.
Achieving the right balance without sacrificing concurrency requires carefully pacing the garbage collector.
GC pacing aims to optimize along two dimensions: heap growth, and CPU utilized by the garbage collector.
The design of GC pacing consists of four components:
an estimator for the amount of scanning work a GC cycle will require,
a mechanism for mutators to perform the estimated amount of scanning work by the time heap allocation reaches the heap goal,
a scheduler for background scanning when mutator assists underutilize the CPU budget, and
a proportional controller for the GC trigger.
The design balances two different views of time: CPU time and heap time.
CPU time is like standard wall clock time, but passes GOMAXPROCS times faster.
That is, if GOMAXPROCS is 8, then eight CPU seconds pass every wall second and GC gets two seconds of CPU time every wall second.
The CPU scheduler manages CPU time.
The passage of heap time is measured in bytes and moves forward as mutators allocate.
The relationship between heap time and wall time depends on the allocation rate and can change constantly.
Mutator assists manage the passage of heap time, ensuring the estimated scan work has been completed by the time the heap reaches the goal size.
Finally, the trigger controller creates a feedback loop that ties these two views of time together, optimizing for both heap time and CPU time goals.
This is the implementation of the GC:
https://github.com/golang/go/blob/master/src/runtime/mgc.go
From the docs in the source:
The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is non-generational and non-compacting. Allocation is done using size segregated per P allocation areas to minimize fragmentation while eliminating locks in the common case.
Go 1.8 GC might evolve again, with the proposal "Eliminate STW stack re-scanning"
As of Go 1.7, the one remaining source of unbounded and potentially non-trivial stop-the-world (STW) time is stack re-scanning.
We propose to eliminate the need for stack re-scanning by switching to a hybrid write barrier that combines a Yuasa-style deletion write barrier [Yuasa '90] and a Dijkstra-style insertion write barrier [Dijkstra '78].
Preliminary experiments show that this can reduce worst-case STW time to under 50µs, and this approach may make it practical to eliminate STW mark termination altogether.
The announcement is here and you can see the relevant source commit is d70b0fe and earlier.
I'm not sure, but I think the current (tip) GC is already a parallel one or at least it's a WIP. Thus the stop-the-world property doesn't apply any more or will not in the near future. Perhaps someone other can clarify this in more detail.

How can garbage collectors be faster than explicit memory deallocation?

I was reading this html generated, (may expire, Here is the original ps file.)
GC Myth 3: Garbage collectors are always slower than explicit memory deallocation.
GC Myth 4: Garbage collectors are always faster than explicit memory deallocation.
This was a big WTF for me. How would GC be faster then explicit memory deallocation? isnt it essentially calling a explicit memory deallocator when it frees the memory/make it for use again? so.... wtf.... what does it actually mean?
Very small objects & large sparse
heaps ==> GC is usually cheaper,
especially with threads
I still don't understand it. Its like saying C++ is faster then machine code (if you don't understand the wtf in this sentence please stop programming. Let the -1 begin). After a quick google one source suggested its faster when you have a lot of memory. What i am thinking is it means it doesn't bother will the free at all. Sure that can be fast and i have written a custom allocator that does that very thing, not free at all (void free(void*p){}) in ONE application that doesnt free any objects (it only frees at end when it terminates) and has the definition mostly in case of libs and something like stl. So... i am pretty sure this will be faster the GC as well. If i still want free-ing i guess i can use an allocator that uses a deque or its own implementation thats essentially
if (freeptr < someaddr) {
*freeptr=ptr;
++freeptr;
}
else
{
freestuff();
freeptr = freeptrroot;
}
which i am sure would be really fast. I sort of answered my question already. The case the GC collector is never called is the case it would be faster but... i am sure that is not what the document means as it mention two collectors in its test. i am sure the very same application would be slower if the GC collector is called even once no matter what GC used. If its known to never need free then an empty free body can be used like that one app i had.
Anyways, i post this question for further insight.
How would GC be faster then explicit memory deallocation?
GCs can pointer-bump allocate into a thread-local generation and then rely upon copying collection to handle the (relatively) uncommon case of evacuating the survivors. Traditional allocators like malloc often compete for global locks and search trees.
GCs can deallocate many dead blocks simultaneously by resetting the thread-local allocation buffer instead of calling free on each block in turn, i.e. O(1) instead of O(n).
By compacting old blocks so more of them fit into each cache line. The improved locality increases cache efficiency.
By taking advantage of extra static information such as immutable types.
By taking advantage of extra dynamic information such as the changing topology of the heap via the data recorded by the write barrier.
By making more efficient techniques tractable, e.g. by removing the headache of manual memory management from wait free algorithms.
By deferring deallocation to a more appropriate time or off-loading it to another core. (thanks to Andrew Hill for this idea!)
One approach to make GC faster then explicit deallocation is to deallocate implicitly :the heap is divided in partitions, and the VM switches between the partitions from time to time (when a partition gets too full for example). Live objects are copied to the new partition and all the dead objects are not deallocated - they are just left forgotten. So the deallocation itself ends up costing nothing. The additional benefit of this approach is that the heap defragmentation is a free bonus.Please note this is a very general description of the actual processes.
The trick is, that the underlying allocator for garbage collector can be much simpler than the explicit one and take some shortcuts that the explicit one can't.
If the collector is copying (java and .net and ocaml and haskell runtimes and many others actually use one), freeing is done in big blocks and allocating is just pointer increment and cost is payed per object surviving collection. So it's faster especially when there are many short-lived temporary objects, which is quite common in these languages.
Even for a non-copying collector (like the Boehm's one) the fact that objects are freed in batches saves a lot of work in combining the adjacent free chunks. So if the collection does not need to be run too often, it can easily be faster.
And, well, many standard library malloc/free implementations just suck. That's why there are projects like umem and libraries like glib have their own light-weight version.
A factor not yet mentioned is that when using manual memory allocation, even if object references are guaranteed not to form cycles, determining when the last entity to hold a reference has abandoned it can be expensive, typically requiring the use of reference counters, reference lists, or other means of tracking object usage. Such techniques aren't too bad on single-processor systems, where the cost of an atomic increment may be essentially the same as an ordinary one, but they scale very badly on multi-processor systems, where atomic-increment operations are comparatively expensive.

Which counter can I use in performance monitor to see how much memory is waiting for the GC?

I am trying to profile a specific page of my ASP.NET site to optimize memory usage, but the nature of .NET as a Garbage Collected language is making it tough to get a true picture of what how memory is used and released in the program.
Is there a perfmon counter or other method for profiling that will allow me to see not only how much memory is allocated, but also how much has been released by the program and is just waiting for garbage collection?
Actually nothing in the machine really knows what is waiting for garbage collection: garbage collection is precisely the process of figuring that out and releasing the memory corresponding to dead objects. At best, the GC will have that information only on some very specific instants in its cycle. The detection and release parts are often interleaved (this depends on the GC technology) so it is possible that the GC never has a full count of what could be freed.
For most GC, obtaining such an information is computationally expensive. If you are ready to spend a bit of CPU time on it (it will not be transparent to the application) then you can use GC.Collect() to force the GC to run, immediately followed by a call to GC.GetTotalMemory() to know how much memory has survived the GC. Note that forcing the GC could induce a noticeable pause, and may also decrease overall performance.
This is the "homemade" method; for a more serious analysis, try a dedicated profiler.
The best way that I have been able to profile memory is to use ANTS Profiler from RedGate. You can view a snapshot, what stage of the lifecycle it is in and more. Including actual object values.

Resources