Can you have memory leaks with a garbage collector?

Can you have memory leaks with a garbage collector? - memory-leaks

If I have a garbage collector that tracks every object allocated and deallocates them as soon as they no longer have usable references to them can you still have a memory leak?
Considering a memory leak is allocations without any reference isn't that impossible or am I missing something?
Edit: So what I'm counting as a memory leak is allocations which you no longer have any reference to in the code. Large numbers of accumulating allocations which you still have references to aren't the leaks I'm considering here.
I'm also only talking about normal state of the art G.C., It's been a while but I know cases like cyclical references don't trip them up. I don't need a specific answer for any language, this is just coming from a conversation I was having with a friend. We were talking about Actionscript and Java but I don't care for answers specific to those.
Edit2: From the sounds of it, there doesn't seem to be any reason code can completely lose the ability to reference an allocation and not have a GC be able to pick it up, but I'm still waiting for more to weigh in.

If your question is really this:
Considering a memory leak is allocations without any reference isn't
that impossible or am I missing something?
Then the answer is "yes, that's impossible" because a properly implemented garbage collector will reclaim all allocations that don't have active references.
However, you can definitely have a "memory leak" in (for example) Java. My definition of a "memory leak" is an allocation that still has an active reference (so that it won't be reclaimed by the garbage collector) but the programmer doesn't know that the object isn't reclaimable (ie: for the programmer, this object is dead and should be reclaimed). A simple example is something like this:
ObjectA -> ObjectB
In this example, ObjectA is an object in active use in the code. However, ObjectA contains a reference to ObjectB that is effectively dead (ie: ObjectB has been allocated and used and is now, from the programmer's perspective, dead) but the programmer forgot to set the reference in ObjectA to null. In this case, ObjectB has been "leaked".
Doesn't sound like a big problem, but there are situations where these leaks are cumulative. Let's imagine that ObjectA and ObjectB are actually instances of the same class. And this problem that the programmer forgot to set the reference to null happens every time such an instance is used. Eventually you end up with something like this:
ObjectA -> ObjectB -> ObjectC -> ObjectD -> ObjectE -> ObjectF -> ObjectG -> ObjectH -> etc...
Now ObjectB through ObjectH are all leaked. And problems like this will (eventually) cause your program to crash. Even with a properly implemented garbage collector.

To decide whether a program has a memory leak, one must first define what a leak is. I would define a program as having a memory leak if there exists some state S and series of inputs I such that:
If the program is in state `S` and it receives inputs `I`, it will still be in state `S` (if it doesn't crash), but...
The amount of memory required to repeat the above sequence `N` times will increase without bound.
It is definitely possible for programs that run entirely within garbage-collected frameworks to have memory leaks as defined above. A common way in which that can occur is with event subscriptions.
Suppose a thread-safe collection exposes a CollectionModified event, and the IEnumerator<T> returned by its IEnumerable<T>.GetEnumerator() method subscribes to that event on creation, and unsubscribes on Dispose; the event is used to allow enumeration to proceed sensibly even when the collection is modified (e.g. ensuring that objects are in the collection continuously throughout the enumeration will be returned exactly once; those that exist during part of it will be returned no more than once). Now suppose a long-lived instance of that collection class is created, and some particular input will cause it to be enumerated. If the CollectionModified event holds a strong reference to every non-disposed IEnumerator<T>, then repeatedly enumerating the collection will create and subscribe an unbounded number of enumerator objects. Memory leak.

Memory leaks don't just depend how efficient a garbage collection algorithm is, if your program holds on to object references which have long life time, say in an instance variable or static variable without being used, your program will have memory leaks.
Reference count have a known problem of cyclic refernces meaning
Object 1 refers to Object 2 and Object 2 refers to Object 1
but no one else refers to object 1 or Object 2, reference count algorithm will fail in this scenario.
Since you are working on garbage collector itself, its worth reading about different implementation strategies.

You can have memory leaks with a GC in another way: if you use a conservative garbage collector that naively scans the memory and for everything that looks like a pointer, doesn't free the memory it "points to", you may leave unreachable memory allocated.

Related

Why is garbage collection necessary?

Suppose that an object on the heap goes out of scope. Why can't the program free the memory right after the scope ends? Or, if we have a pointer to an object that is replaced by the address to a new object, why can't the program deallocate the old one before assigning the new one? I'm guessing that it's faster not to free it immediately and instead have the freeing be done asynchronously at a later point in time, but I'm not really sure.

Why is garbage collection necessary?
It is not strictly necessary. Given enough time and effort you can always translate a program that depends on garbage collection to one that doesn't.
In general, garbage collection involves a trade-off.
On the one hand, garbage collection allows you to write an application without worrying about the details of memory allocation and deallocation. (And the pain of debugging crashes and memory leaks caused by getting the deallocation logic wrong.)
The downside of garbage collection is that you need more memory. A typical garbage collector is not efficient if it doesn't have plenty of spare space1.
By contrast, if you do manual memory management, you can code your application to free up heap objects as soon as they are no longer used. Furthermore, you don't get awkward "pauses" while the GC is doing its thing.
The downside of manual memory management is that you have to write the code that decides when to call free, and you have to get it correct. Furthermore, if you try to manage memory by reference counting:
you have the cost of incrementing and decrementing ref counts whenever pointers are assign or variables go out of scope,
you have to deal with cycles in your data structures, and
it is worse when your application is multi-threaded and you have to deal with memory caches, synchronization, etc.
For what it is worth, if you use a decent garbage collector and tune it appropriately (e.g. give it enough memory, etc) then the CPU costs of GC and manual storage management are comparable when you apply them to a large application.
Reference:
"The measured cost of conservative garbage collection" by Benjamin Zorn
1 - This is because the main cost of a modern collector is in traversing and dealing with the non-garbage objects. If there is not a lot of garbage because you are being miserly with the heap space, the GC does a lot of work for little return. See https://stackoverflow.com/a/2414621/139985 for an analysis.

It's more complicated, but
1) what if there is memory pressure before the scope is over? Scope is only a language notion, not related to reachability. So an object can be "freed" before it goes out of scope ( java GCs do that on regular basis). Also, if you free objects after each scope is done, you might be doing too little work too often
2) As far as the references go, you are not considering that the reference might have hierarchies and when you change one, there has to be code that traverses those. It might not be the right time to do it when that happens.
In general, there is nothing wrong with such a proposal that you describer, as a matter of fact this is almost exactly how Rust programming language works, from a high level point of view.

Is there a way to drop a static lifetime object in Rust?

When searching for an answer, I found this question, however there is no mention of static lifetime objects. Can the method mentioned in this answer (calling drop() on the object) be used for static lifetime objects?
I was imagining a situation like a linked list. You need to keep nodes of the list around for (potentially) the entire lifetime of the program, however you also may remove items from the list. It seems wasteful to leave them in memory for the entire execution of the program.
Thanks!

No. The very point of a static is that it's static: It has a fixed address in memory and can't be moved from there. As a consequence, everybody is free to have a reference to that object, because it's guaranteed to be there as long as the program is executing. That's why you only get to use a static in the form of a &'static-reference and can never claim ownership.
Besides, doing this for the purpose of memory conservation is pointless: The object is baked into the executable and mapped to memory on access. All that could happen is for the OS to relinquish the memory mapping. Yet, since the memory is never allocated from the heap in the first place, there is no saving to be had.
The only thing you could do is to replace the object using unsafe mutable access. This is both dangerous (because the compiler is free to assume that the object does not in fact change) and pointless, due to the fact that the memory can't be freed, as it's part of the executable's memory mapping.

Is reference counting the only GC method where the objects can be reclaimed at determined time?

With the reference counting, the objects can be reclaimed right after they're no longer referenced. It doesn't require running a separate thread for GC.
Other GC methods, such as mark and sweep, runs on its own thread and we cannot determine when it runs Maybe the youngest generation is reclaimed when the function returns, but some other objects which are pushed to the next generation might be garbage as well.
Is there any other GC method which reclaims the objects at determined time?

If by "at determined time" you mean "as soon as possible, in the absence of cycles", then no. You need naive reference counting for that, with all its problems, you can't even use any of the optimizations of reference counting (such as deferred reference counting).
If times like "at the end of a scope" are acceptable, then yes, it's possible (though not advisable). You just run whatever form of GC you have at that time. This is, of course, highly inefficient, which is one reason nobody does it (the other being that the only advantage, deterministic cleanup, is rarely needed and better handled explicitly). Incremental GCs may alleviate this a bit, but I'm not sure how much.

How do garbage collectors know about references on the stack frame?

What techniques do modern garbage collectors (as in CLR, JVM) use to tell which heap objects are referenced from the stack?
Specifically how can a VM work back from knowing where the stack starts to interpreting all local references to heap objects?

In Java (and likely in the CLR although I know its internals less well), the bytecode is typed with object vs primitive information. As a result, there are data structures in the bytecode that describe which variables in each stack frame are objects and which are primitives. When the GC needs to scan the root set, it uses these StackMapTables to differentiate between references and non-references.
CLR and Java have to have some mechanism like this because they are exact collectors. There are conservative collectors like the boehm collector that treat every offset on the stack as a possible pointer. They look to see if the value (when treated as a pointer) is an offset into the heap, and if so, they mark it as alive.

Take a look at this Artima article from August 1996, Java's Garbage-Collected Heap; especially page 2.
Any garbage collection algorithm must do two basic things. First, it must detect garbage objects. Second, it must reclaim the heap space used by the garbage objects and make it available to the program. Garbage detection is ordinarily accomplished by defining a set of roots and determining reachability from the roots. An object is reachable if there is some path of references from the roots by which the executing program can access the object. The roots are always accessible to the program. Any objects that are reachable from the roots are considered live. Objects that are not reachable are considered garbage, because they can no longer affect the future course of program execution.
In a JVM the root set is implementation dependent but would always include any object references in the local variables. In the JVM, all objects reside on the heap. The local variables reside on the Java stack, and each thread of execution has its own stack. Each local variable is either an object reference or a primitive type, such as int, char, or float. Therefore the roots of any JVM garbage-collected heap will include every object reference on every thread's stack. Another source of roots are any object references, such as strings, in the constant pool of loaded classes. The constant pool of a loaded class may refer to strings stored on the heap, such as the class name, superclass name, superinterface names, field names, field signatures, method names, and method signatures.
Any object referred to by a root is reachable and is therefore a live object. Additionally, any objects referred to by a live object are also reachable. The program is able to access any reachable objects, so these objects must remain on the heap. Any objects that are not reachable can be garbage collected because there is no way for the program to access them.
The article continues to explore different garbage collection strategies, including reference counting collectors, tracing collectors, compacting collectors and copying collectors.
Though this article is old, it still applies today; not much has really changed. There have been performance improvements to the different collection strategies, but no new major advancements.
The Oracle HotSpot JVM, for example, has a new Garbage-First Garbage Collector which is a copying collector with performance tweaks for multi-core processors and large heap sizes (see this answer for more on the G1 Garbage Collector).

Interesting documentation on this topic posted up by the .Net team shortly after the they made CoreCLR open source: Stack Walking

How do garbage collectors track all live objects?

Garbage collection involves walking through a list of allocated objects (either all objects or objects in a particular generation) and determining which are reachable.
How is this list maintained? Do runtimes for GC languages keep a giant list of all objects?
Also, from what I understand, GC involves walking the call stack to look for object references - how does the algorithm distinguish between GC-able pointers and primitive data?

The memory management system keeps track of the size of each allocated object, just like it does in C or C++. One way this is commonly done is for the memory management system to allocate an extra size_t before each allocation, that keeps track of the size of each objecct. The memory manager likewise has to keep track of the size of each free block, so that it can reuse blocks to allocate them.
The garbage collector works in two phases: the mark phase, and the sweep phase. In the mark phase, the garbage collector starts walks object references in order to find objects that are still reachable. The garbage collector starts at a few basic places where the object references are stored and given names (the stack, and global storage, and static storage), and then traverses references in the objects.
In the sweep phase, the garbage collector walks the heap from bottom to top, jumping from allocation to allocation based on those size_ts, and frees anything that isn't marked.
Some languages (like Ruby) tag all of the primitives so that they can be identified separately from the object references at runtime. Other garbage collectors are ver conservative and follow primatives as through they were object references (though some checks must be performed to make sure that the garbage collector doesn't stick a mark in the middle of some other object). Still other languages use runtime type information to be more precise about whether they follow primatives.
Ruby's garbage collector sometimes called "conservative" because it doesn't check whether the space on the stack is actually being used, so it sometimes keeps dead objects alive by following ghost references on the stack. But since it always knows exactly whether the data it's looking at is a reference or a primative, I don't call it conservative here.

Garbage collection involves walking through a list of allocated objects (either all objects or objects in a particular generation) and determining which are reachable.
Not really. GCs are categorized into tracing and reference counting (see A unified theory of garbage collection). Tracing GCs start from a set of global roots and trace all objects reachable from them. Reference counting GCs count the number of references to each object and reclaim it when the count reaches zero. Neither require a list including unreachable objects.
How is this list maintained? Do runtimes for GC languages keep a giant list of all objects?
Pedagogical solutions like the one in HLVM can keep a list of all objects because it is simple but this is rare.
Also, from what I understand, GC involves walking the call stack to look for object references - how does the algorithm distinguish between GC-able pointers and primitive data?
Again, there are many different strategies. Conservative GCs are unable to distinguish between pointers and non-pointers so they conservatively consider that non-pointers might be pointers. Pedagogical GCs like the one in HLVM can use algorithms like Henderson's Accurate GC in an uncooperative environment. Production GCs store enough information in the OS thread stack to determine exactly which words are pointers (and which stack frames to skip because they are not affiliated with managed code) and then use a stack walker to find them.
Note that you also have to find local references held in registers as well as on the stack.

This site ( How Java’s Garbage Collector Works? ) has a good, brief explanation on how garbage collectors work, not just the default Java one.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string