What techniques do modern garbage collectors (as in CLR, JVM) use to tell which heap objects are referenced from the stack?
Specifically how can a VM work back from knowing where the stack starts to interpreting all local references to heap objects?
In Java (and likely in the CLR although I know its internals less well), the bytecode is typed with object vs primitive information. As a result, there are data structures in the bytecode that describe which variables in each stack frame are objects and which are primitives. When the GC needs to scan the root set, it uses these StackMapTables to differentiate between references and non-references.
CLR and Java have to have some mechanism like this because they are exact collectors. There are conservative collectors like the boehm collector that treat every offset on the stack as a possible pointer. They look to see if the value (when treated as a pointer) is an offset into the heap, and if so, they mark it as alive.
Take a look at this Artima article from August 1996, Java's Garbage-Collected Heap; especially page 2.
Any garbage collection algorithm must do two basic things. First, it must detect garbage objects. Second, it must reclaim the heap space used by the garbage objects and make it available to the program. Garbage detection is ordinarily accomplished by defining a set of roots and determining reachability from the roots. An object is reachable if there is some path of references from the roots by which the executing program can access the object. The roots are always accessible to the program. Any objects that are reachable from the roots are considered live. Objects that are not reachable are considered garbage, because they can no longer affect the future course of program execution.
In a JVM the root set is implementation dependent but would always include any object references in the local variables. In the JVM, all objects reside on the heap. The local variables reside on the Java stack, and each thread of execution has its own stack. Each local variable is either an object reference or a primitive type, such as int, char, or float. Therefore the roots of any JVM garbage-collected heap will include every object reference on every thread's stack. Another source of roots are any object references, such as strings, in the constant pool of loaded classes. The constant pool of a loaded class may refer to strings stored on the heap, such as the class name, superclass name, superinterface names, field names, field signatures, method names, and method signatures.
Any object referred to by a root is reachable and is therefore a live object. Additionally, any objects referred to by a live object are also reachable. The program is able to access any reachable objects, so these objects must remain on the heap. Any objects that are not reachable can be garbage collected because there is no way for the program to access them.
The article continues to explore different garbage collection strategies, including reference counting collectors, tracing collectors, compacting collectors and copying collectors.
Though this article is old, it still applies today; not much has really changed. There have been performance improvements to the different collection strategies, but no new major advancements.
The Oracle HotSpot JVM, for example, has a new Garbage-First Garbage Collector which is a copying collector with performance tweaks for multi-core processors and large heap sizes (see this answer for more on the G1 Garbage Collector).
Interesting documentation on this topic posted up by the .Net team shortly after the they made CoreCLR open source: Stack Walking
Related
I am taking a course on programming language design, and one of the topics is Garbage Collection. I understood from the material that RC can be used for GC, but that it also has other uses, and that some languages implement RC but not GC.
What exactly is the use of RC if not for GC?
(RC - reference counting. GC - garbage collection)
Reference counting may be used, eg, to close unreferenced file handles, or to somehow "archive" currently unreferenced data (that can potentially be rereferenced through some indirect path in the future).
I can provide a specific example of reference counting used independently of garbage collection. Objective-C uses reference counting to manage the lifetime of its objects, without the presence of a garbage collector in most cases.
This is done through balanced calls to -retain and -release when dealing with objects. Basically, an object is created with a retain count of 1, and every object that needs to hold on to a reference to an object should increase its retain count by 1 when being passed it initially, then decrement the retain count by 1 when done with it. The final -release call that causes the object's retain count to drop to 0 (no one should have a need for it anymore) triggers the internal mechanisms of the object type to deallocate itself.
No garbage collector process is required for this to take place. In fact, there wasn't a garbage collector for this on Apple's platforms (who are by far the largest users of Objective-C) until fairly recently, and that's not even used for its iOS mobile devices (and is now deprecated on the Mac desktop).
The reference counting in Objective-C is by default manual in nature, requiring developers to follow certain conventions to make sure that you safely balance retain and release calls to avoid leaks or premature deallocations. A newer system was just implemented within the LLVM compiler that automates this, and adds the appropriate calls at compile-time. This automatic reference counting removes much of the effort of managing memory while removing the need for a garbage collector process to sweep object graphs.
One specific condition that a garbage collector can handle that reference counting cannot is the detection and removal of retain cycles. Objects that hold strong references to that point back upon themselves in a cycle will never be deallocated under standard reference counting, even if all objects referencing those in the cycle release the references to those within it. A garbage collector will see that this cycle is not rooted in the larger object graph and will be able to remove the entire cycle when it performs a sweep.
If I have a garbage collector that tracks every object allocated and deallocates them as soon as they no longer have usable references to them can you still have a memory leak?
Considering a memory leak is allocations without any reference isn't that impossible or am I missing something?
Edit: So what I'm counting as a memory leak is allocations which you no longer have any reference to in the code. Large numbers of accumulating allocations which you still have references to aren't the leaks I'm considering here.
I'm also only talking about normal state of the art G.C., It's been a while but I know cases like cyclical references don't trip them up. I don't need a specific answer for any language, this is just coming from a conversation I was having with a friend. We were talking about Actionscript and Java but I don't care for answers specific to those.
Edit2: From the sounds of it, there doesn't seem to be any reason code can completely lose the ability to reference an allocation and not have a GC be able to pick it up, but I'm still waiting for more to weigh in.
If your question is really this:
Considering a memory leak is allocations without any reference isn't
that impossible or am I missing something?
Then the answer is "yes, that's impossible" because a properly implemented garbage collector will reclaim all allocations that don't have active references.
However, you can definitely have a "memory leak" in (for example) Java. My definition of a "memory leak" is an allocation that still has an active reference (so that it won't be reclaimed by the garbage collector) but the programmer doesn't know that the object isn't reclaimable (ie: for the programmer, this object is dead and should be reclaimed). A simple example is something like this:
ObjectA -> ObjectB
In this example, ObjectA is an object in active use in the code. However, ObjectA contains a reference to ObjectB that is effectively dead (ie: ObjectB has been allocated and used and is now, from the programmer's perspective, dead) but the programmer forgot to set the reference in ObjectA to null. In this case, ObjectB has been "leaked".
Doesn't sound like a big problem, but there are situations where these leaks are cumulative. Let's imagine that ObjectA and ObjectB are actually instances of the same class. And this problem that the programmer forgot to set the reference to null happens every time such an instance is used. Eventually you end up with something like this:
ObjectA -> ObjectB -> ObjectC -> ObjectD -> ObjectE -> ObjectF -> ObjectG -> ObjectH -> etc...
Now ObjectB through ObjectH are all leaked. And problems like this will (eventually) cause your program to crash. Even with a properly implemented garbage collector.
To decide whether a program has a memory leak, one must first define what a leak is. I would define a program as having a memory leak if there exists some state S and series of inputs I such that:
If the program is in state `S` and it receives inputs `I`, it will still be in state `S` (if it doesn't crash), but...
The amount of memory required to repeat the above sequence `N` times will increase without bound.
It is definitely possible for programs that run entirely within garbage-collected frameworks to have memory leaks as defined above. A common way in which that can occur is with event subscriptions.
Suppose a thread-safe collection exposes a CollectionModified event, and the IEnumerator<T> returned by its IEnumerable<T>.GetEnumerator() method subscribes to that event on creation, and unsubscribes on Dispose; the event is used to allow enumeration to proceed sensibly even when the collection is modified (e.g. ensuring that objects are in the collection continuously throughout the enumeration will be returned exactly once; those that exist during part of it will be returned no more than once). Now suppose a long-lived instance of that collection class is created, and some particular input will cause it to be enumerated. If the CollectionModified event holds a strong reference to every non-disposed IEnumerator<T>, then repeatedly enumerating the collection will create and subscribe an unbounded number of enumerator objects. Memory leak.
Memory leaks don't just depend how efficient a garbage collection algorithm is, if your program holds on to object references which have long life time, say in an instance variable or static variable without being used, your program will have memory leaks.
Reference count have a known problem of cyclic refernces meaning
Object 1 refers to Object 2 and Object 2 refers to Object 1
but no one else refers to object 1 or Object 2, reference count algorithm will fail in this scenario.
Since you are working on garbage collector itself, its worth reading about different implementation strategies.
You can have memory leaks with a GC in another way: if you use a conservative garbage collector that naively scans the memory and for everything that looks like a pointer, doesn't free the memory it "points to", you may leave unreachable memory allocated.
Could anyone point me to a good source on how to implement garbage collection? I am making a lisp-like interpreted language. It currently uses reference counting, but of course that fails at freeing circularly dependent objects.
I've been reading of mark and sweep, tricolor marking, moving and nonmoving, incremental and stop-the-world, but... I don't know what the best way to keep the objects neatly separated into sets while keeping per-object memory overhead at a minimum, or how to do things incrementally.
I've read some languages with reference counting use circular reference detection, which I could use. I am aware I could use freely available collectors like Boehm, but I would like to learn how to do it myself.
I would appreciate any online material with some sort of tutorial or help for people with no experience on the topic like myself.
Could anyone point me to a good source on how to implement garbage collection?
There's a lot of advanced material about garbage collection out there. The Garbage Collection Handbook is great. But I found there was precious little basic introductory information so I wrote some articles about it. Prototyping a mark-sweep garbage collector describes a minimal mark-sweep GC written in F#. The Very Concurrent Garbage Collector describes a more advanced concurrent collector. HLVM is a virtual machine I wrote that includes a stop-the-world collector that handles threading.
The simplest way to implement a garbage collector is:
Make sure you can collate the global roots. These are the local and global variables that contain references into the heap. For local variables, push them on to a shadow stack for the duration of their scope.
Make sure you can traverse the heap, e.g. every value in the heap is an object that implements a Visit method that returns all of the references from that object.
Keep the set of all allocated values.
Allocate by calling malloc and inserting the pointer into the set of all allocated values.
When the total size of all allocated values exceeds a quota, kick off the mark and then sweep phases. This recursively traverses the heap accumulating the set of all reachable values.
The set difference of the allocated values minus the reachable values is the set of unreachable values. Iterate over them calling free and removing them from the set of allocated values.
Set the quota to twice the total size of all allocated values.
Check out the following page. It has many links. http://lua-users.org/wiki/GarbageCollection
As suggested by delnan, I started with a very naïve stop-the-world tri-color mark and sweep algorithm. I managed to keep the objects in the sets by making them linked-list nodes, but it does add a lot of data to each object (the virtual pointer, two pointers to nodes, one enum to hold the color). It works perfectly, no memory lost on valgrind :) From here I might try to add a free list for recycling, or some sort of thing that detects when it is convenient to stop the world, or an incremental approach, or a special allocator to avoid fragmentation, or something else. If you can point me where to find info or advice (I don't know whether you can comment on an answered question) on how to do these things or what to do, I'd be very thankful. I'll be checking Lua's GC in the meantime.
I have implemented a Cheney-style copying garbage collector in C in about 400 SLOC. I did it for a statically-typed language and, to my surprise, the harder part was actually communicating the information which things are pointers and which things aren't. In a dynamically typed language this is probably easier since you must already use some form of tagging scheme.
There also is a new version of the standard book on garbage collection coming out: "The Garbage Collection Handbook: The Art of Automatic Memory Management" by Jones, Hosking, Moss. (The Amazon UK site says 19 Aug 2011.)
One thing I haven't yet seen mentioned is the use of memory handles. One may avoid the need to double-up on memory (as would be needed with the Cheney-style copying algorithm) if each object reference is a pointer to a structure which contains the real address of the object in question. Using handles for memory objects will make certain routines a little slower (one must reread the memory address of an object any time something might have happened that would move it) but for single-threaded systems where garbage collection will only happen at predictable times, this isn't too much of a problem and doesn't require special compiler support (multi-threaded GC systems will are likely to require compiler-generated metadata whether they use handles or direct pointers).
If one uses handles, and uses one linked list for live handles (the same storage can be used to hold a linked list for dead handles needing reallocation), one can, after marking the master record for each handle, proceed through the list of handles, in allocation order, and copy the block referred to by that handle to the beginning of the heap. Because handles will be copied in order, there will be no need to use a second heap area. Further, generations may be supported by keeping track of some top-of-heap pointers. When compactifying memory, start by just compactifying items added since the last GC. If that doesn't free up enough space, compactify items added since the last level 1 GC. If that doesn't free up enough space, compactify everything. The marking phase would probably have to act upon objects of all generations, but the expensive compactifying stage would not.
Actually, using a handle-based approach, if one is marking things of all generations, one could if desired compute on each GC pass the amount of space that could be freed in each generation. If half the objects in Gen2 are dead, it may be worthwhile to do a Gen2 collection so as to reduce the frequency of Gen1 collections.
Garbage collection implementation in Lisp
Building LISP | http://www.lwh.jp/lisp/
Arcadia | https://github.com/kimtg/arcadia
Read Memory Management: Algorithms and Implementations in C/C++. It's a good place to start.
I'm doing similar work for my postscript interpreter. more info via my question. I agree with Delnan's comment that a simple mark-sweep algorithm is a good place to start. You'll need functions to set-mark, check-mark, clear-mark, and iterators for all your containers. One easy optimization is to clear-mark whenever allocating a new object, and clear-mark during the sweep; otherwise you'll need an entire pass to clear marks before you start setting them.
Garbage collection involves walking through a list of allocated objects (either all objects or objects in a particular generation) and determining which are reachable.
How is this list maintained? Do runtimes for GC languages keep a giant list of all objects?
Also, from what I understand, GC involves walking the call stack to look for object references - how does the algorithm distinguish between GC-able pointers and primitive data?
The memory management system keeps track of the size of each allocated object, just like it does in C or C++. One way this is commonly done is for the memory management system to allocate an extra size_t before each allocation, that keeps track of the size of each objecct. The memory manager likewise has to keep track of the size of each free block, so that it can reuse blocks to allocate them.
The garbage collector works in two phases: the mark phase, and the sweep phase. In the mark phase, the garbage collector starts walks object references in order to find objects that are still reachable. The garbage collector starts at a few basic places where the object references are stored and given names (the stack, and global storage, and static storage), and then traverses references in the objects.
In the sweep phase, the garbage collector walks the heap from bottom to top, jumping from allocation to allocation based on those size_ts, and frees anything that isn't marked.
Some languages (like Ruby) tag all of the primitives so that they can be identified separately from the object references at runtime. Other garbage collectors are ver conservative and follow primatives as through they were object references (though some checks must be performed to make sure that the garbage collector doesn't stick a mark in the middle of some other object). Still other languages use runtime type information to be more precise about whether they follow primatives.
Ruby's garbage collector sometimes called "conservative" because it doesn't check whether the space on the stack is actually being used, so it sometimes keeps dead objects alive by following ghost references on the stack. But since it always knows exactly whether the data it's looking at is a reference or a primative, I don't call it conservative here.
Garbage collection involves walking through a list of allocated objects (either all objects or objects in a particular generation) and determining which are reachable.
Not really. GCs are categorized into tracing and reference counting (see A unified theory of garbage collection). Tracing GCs start from a set of global roots and trace all objects reachable from them. Reference counting GCs count the number of references to each object and reclaim it when the count reaches zero. Neither require a list including unreachable objects.
How is this list maintained? Do runtimes for GC languages keep a giant list of all objects?
Pedagogical solutions like the one in HLVM can keep a list of all objects because it is simple but this is rare.
Also, from what I understand, GC involves walking the call stack to look for object references - how does the algorithm distinguish between GC-able pointers and primitive data?
Again, there are many different strategies. Conservative GCs are unable to distinguish between pointers and non-pointers so they conservatively consider that non-pointers might be pointers. Pedagogical GCs like the one in HLVM can use algorithms like Henderson's Accurate GC in an uncooperative environment. Production GCs store enough information in the OS thread stack to determine exactly which words are pointers (and which stack frames to skip because they are not affiliated with managed code) and then use a stack walker to find them.
Note that you also have to find local references held in registers as well as on the stack.
This site ( How Java’s Garbage Collector Works? ) has a good, brief explanation on how garbage collectors work, not just the default Java one.
Imagine that you call from a language with GC repetitively a function from another language (e.g., Fortran 95). The Fortran function leaves something allocated in the memory between calls which might be seen from the caller language as unreferenced rubbish.
Could GC from the caller language access the memory allocated in Fortran and consider it as rubbish and free it?
I guess that it won't happen. The memory allocated by the Fortran function should have its own memory management separated from the memory managed by GC, however, I would be happy if anyone could confirm that.
Why do I need it? (if anyone is interested)
As described above, I need to write a function in F95 which allocates its own memory, is called several times and it needs to keep the reference to the allocated memory between calls. The problem is that Fortran pointers are incompatibile with the outside world so I can't just pass something as 'void *' from Fortran. Therefore the Fortran function would store the pointer not as a pointer but would cast it (for example) as an integer array for the outside world. However, if GC could anyhow interfere memory from Fortran, it might not understand that the reference is kept in the integer array and might want to free the memory allocated in Fortran, which would be bad.
No, unless the language explicitely integrated with the host langauge (using the garbage collector). In .NET... a C++ application can use C++/CLI to allocate .NET objects and return those - and those naturally are garbage collected. I do that in a number of projects.
But a pure C++ object... the garbage colelctor knows nothing about and does not know how to handle.
There probably isn't a single answer to this question that's guaranteed to be correct. As a rule, however, a garbage collector will be associated with some sort of heap allocator, and can/will only collect memory within the heap it controls. Since your Fortran function will (presumably) allocate its memory completely separately, it probably won't be affected by the garbage collector.
Without knowing exactly what garbage collector you're talking about, it's probably not possible to say with certainty though.