Garbage collection in Haskell & parallel computations

Garbage collection in Haskell & parallel computations - haskell

Most of languages, using garbage collectors (possibly all of them), have one major issue related to parallel computations: garbage collector has to stop all running threads in order to delete unused objects. Haskell has garbage collector too. But, due to purity, Haskell guarantees that no computation changes the original data, it produces a copy instead, and then makes the changes. I suppose, in that way, GC does not has to stop all threads to do its job. I'm just curious: does Haskell have the same issue with garbage collection or not?

GHC's garbage collector is parallel but not concurrent. That means that it can use all threads to perform garbage collection, but it has to stop all threads to do that. Concurrent garbage collection is much harder to implement (and usually has a larger performance overhead).
It is somewhat ironic that Haskell does indeed use lots of mutable objects, namely thunks (unevaluated expressions). Mutable objects cannot freely be duplicated (and even for immutable objects too much duplication should be kept in check).
For programs running on multiple cores having a truly concurrent collector would be nice, but you can also get decent benefits by doing heap-local garbage collection. The idea is that data that is not shared between multiple CPUs can be collected only by the owning CPU. This is typically the case for short-lived data. The Simons have done some recent work in this area. See their paper "Multicore Garbage Collection with Local Heaps" (PDF). This paper also shows some ways how you can exploit immutability in a way similar to what you propose.
Edit: I forgot to mention that Erlang basically does exactly what you propose. Each Erlang process has its own heap and sending a message copies the data from one process to the other. For this reason each Erlang process can do its own GC independently of all the other processes. (The downside is that Erlang doesn't give you shared memory concurrency.)

Seval GC implementations may run in parallel without "stopping the world" as you suggest (I think latest Oracle JVM don't stop the world and is multi-threaded, for an example; and most JVM are not "stop-the-world").
Recall that GC implementations may vary widely from one language implementation to another one.
There is considerable litterature about GC, and still many research papers on parallel garbage collection.
Start with a good book like the GC handbook (by Richard Jones, Antony Hosking, Eliot Moss). Or at least wikipedia Garbage Collection page.
Purely functional languages like Haskell rely heavily on a very performant GC. Other languages have different constraints (for instance, write barriers matters less with Haskell than with Java, but Haskell programs probably allocate more than Java).
For parallel GC, the devil is very much in the details.

Related

What is the advantage of runtime GC over compile-time ARC?

Some newer languages are implementing ARC into their compilers (Swift and Rust, to name a couple). As I understand this achieves the same thing as runtime GC (taking the burden of manual deallocation away from the programmer), while being significantly more efficient.
I understand that ARC could become a complex process, but with the complexity of modern garbage collectors it seems like it would be no more complex to implement ARC. However, there are still tons of languages and frameworks using GC for memory management, and even the Go language, which targets systems programming, uses GC.
I really cannot understand why GC would be preferable to ARC. Am I missing something here?

There's a bunch of tradeoffs involved here, it's a complex topic. Here's the big ones though:
GC pros:
Tracing garbage collectors can handle cycles in object graphs. Automatic reference counting will leak memory unless cycles are manually broken either by removing a reference or figuring out which edge of the graph should be weak. This is quite a common problem in practice in reference counted apps.
Tracing garbage collectors can actually be moderately faster (in terms of throughput) than reference counting, by doing work concurrently, by batching work up, by deferring work, and by not messing up caches touching reference counts in hot loops.
Copying collectors can compact the heap, reclaiming fragmented pages to reduce footprint
ARC pros:
Because object destruction happens immediately when the reference count hits 0, object lifetimes can be used to manage non-memory resources. With garbage collection, lifetimes are non-deterministic, so this isn't safe.
Collection work is typically more spread out, resulting in much shorter pauses (it's still possible to get a pause if you deallocate a large subgraph of objects)
Because memory is collected synchronously, it's not possible to "outrun the collector" by allocating faster than it can clean up. This is particularly important when VM paging comes into play, since there are degenerate cases where the GC thread hits a page that's been paged out, and falls far behind.
On a related note, tracing garbage collectors have to walk the entire object graph, which forces unnecessary page-ins (there are mitigations for this like https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf, but they're not widely deployed)
Tracing garbage collectors typically need more "scratch space" than reference counting if they want to hit their full throughput
My personal take on this is that the only two points that really matter for most cases are:
ARC doesn't collect cycles
GC doesn't have deterministic lifetimes
I feel that both of these issues are deal breakers, but in the absence of a better idea, you just have to pick which horrifying problem sounds worse to you.

How long pauses can occur in a Haskell program due to garbage collection?

Relating to my other question Haskell collections with guaranteed worst-case bounds for every single operation?, I'm curious: How long pauses can be caused by garbage collection?
Does Haskell use some kind of incremental garbage collection so that a program is stopped only for small periods at a time, or can it stop for several seconds in an extreme case?
I found two SPJ's papers on the subject:
https://research.microsoft.com/en-us/um/people/simonpj/papers/non-stop/index.htm.
But I didn't find a reference if these ideas were actually adopted by GHC (or other Haskell implementations).

GHC is designed for computational throughput, not latency. As a result, GHC uses a generational, multi-threaded garbage collector with thread-local heaps.
Garbage collection of thread-local objects does not stop other threads. The occasional major GC of the global heap will pause all threads.
Typically the pauses are in the small number of milliseconds, however there is no guarantee of latency.
You can control the frequency of GC via several runtime flags (e.g. the gc -I interval).

How can garbage collectors be faster than explicit memory deallocation?

I was reading this html generated, (may expire, Here is the original ps file.)
GC Myth 3: Garbage collectors are always slower than explicit memory deallocation.
GC Myth 4: Garbage collectors are always faster than explicit memory deallocation.
This was a big WTF for me. How would GC be faster then explicit memory deallocation? isnt it essentially calling a explicit memory deallocator when it frees the memory/make it for use again? so.... wtf.... what does it actually mean?
Very small objects & large sparse
heaps ==> GC is usually cheaper,
especially with threads
I still don't understand it. Its like saying C++ is faster then machine code (if you don't understand the wtf in this sentence please stop programming. Let the -1 begin). After a quick google one source suggested its faster when you have a lot of memory. What i am thinking is it means it doesn't bother will the free at all. Sure that can be fast and i have written a custom allocator that does that very thing, not free at all (void free(void*p){}) in ONE application that doesnt free any objects (it only frees at end when it terminates) and has the definition mostly in case of libs and something like stl. So... i am pretty sure this will be faster the GC as well. If i still want free-ing i guess i can use an allocator that uses a deque or its own implementation thats essentially
if (freeptr < someaddr) {
*freeptr=ptr;
++freeptr;
}
else
{
freestuff();
freeptr = freeptrroot;
}
which i am sure would be really fast. I sort of answered my question already. The case the GC collector is never called is the case it would be faster but... i am sure that is not what the document means as it mention two collectors in its test. i am sure the very same application would be slower if the GC collector is called even once no matter what GC used. If its known to never need free then an empty free body can be used like that one app i had.
Anyways, i post this question for further insight.

How would GC be faster then explicit memory deallocation?
GCs can pointer-bump allocate into a thread-local generation and then rely upon copying collection to handle the (relatively) uncommon case of evacuating the survivors. Traditional allocators like malloc often compete for global locks and search trees.
GCs can deallocate many dead blocks simultaneously by resetting the thread-local allocation buffer instead of calling free on each block in turn, i.e. O(1) instead of O(n).
By compacting old blocks so more of them fit into each cache line. The improved locality increases cache efficiency.
By taking advantage of extra static information such as immutable types.
By taking advantage of extra dynamic information such as the changing topology of the heap via the data recorded by the write barrier.
By making more efficient techniques tractable, e.g. by removing the headache of manual memory management from wait free algorithms.
By deferring deallocation to a more appropriate time or off-loading it to another core. (thanks to Andrew Hill for this idea!)

One approach to make GC faster then explicit deallocation is to deallocate implicitly :the heap is divided in partitions, and the VM switches between the partitions from time to time (when a partition gets too full for example). Live objects are copied to the new partition and all the dead objects are not deallocated - they are just left forgotten. So the deallocation itself ends up costing nothing. The additional benefit of this approach is that the heap defragmentation is a free bonus.Please note this is a very general description of the actual processes.

The trick is, that the underlying allocator for garbage collector can be much simpler than the explicit one and take some shortcuts that the explicit one can't.
If the collector is copying (java and .net and ocaml and haskell runtimes and many others actually use one), freeing is done in big blocks and allocating is just pointer increment and cost is payed per object surviving collection. So it's faster especially when there are many short-lived temporary objects, which is quite common in these languages.
Even for a non-copying collector (like the Boehm's one) the fact that objects are freed in batches saves a lot of work in combining the adjacent free chunks. So if the collection does not need to be run too often, it can easily be faster.
And, well, many standard library malloc/free implementations just suck. That's why there are projects like umem and libraries like glib have their own light-weight version.

A factor not yet mentioned is that when using manual memory allocation, even if object references are guaranteed not to form cycles, determining when the last entity to hold a reference has abandoned it can be expensive, typically requiring the use of reference counters, reference lists, or other means of tracking object usage. Such techniques aren't too bad on single-processor systems, where the cost of an atomic increment may be essentially the same as an ordinary one, but they scale very badly on multi-processor systems, where atomic-increment operations are comparatively expensive.

Why does Lua use a garbage collector instead of reference counting?

I've heard and experienced it myself: Lua's garbage collector can cause serious FPS drops in games as their scripted part grows.
This is as I found out related to the garbage collector, where for example every Vector() userdata object created temporarily lies around until getting garbage collected.
I know that Python uses reference counting, and that is why it doesn't need any huge, performance eating steps like Luas GC has to do.
Why doesn't Lua use reference counting to get rid of garbage?

Because reference counting garbage collectors can easily leak objects.
Trivial example: a doubly-linked list. Each node has a pointer to the next node - and is itself pointed to by the next one. If you just un-reference the list itself and expect it to be collected, you just leaked the entire list - none of the nodes have a reference count of zero, and hence they'll all keep each other alive. With a reference counting garbage collector, any time you have a cyclic object, you basically need to treat that as an unmanaged object and explicitly dispose of it yourself when you're finished.
Note that Python uses a proper garbage collector in addition to reference counting.

While others have explained why you need a garbage collector, keep in mind that you can configure the garbage collection cycles in Lua to either be smaller, less frequent, or on demand. If you have a lot of memory allocated and are busy drawing frames, then make the thresholds very large to avoid a collection cycle until there is a break in the game.
Lua 5.1 Manual on garbage collection

Reference Counting alone is not enough for a garbage collector to work correctly because it does not detect cycles. Even Python does not use reference counting alone.
Imagine that objects A and B each hold a reference to each other. Even once you, the programmer no longer hold a reference to either object, reference counting will still say that objects A and B have references pointing to them.
There are many different garbage collecting schemes out there and some will work better in some circumstances and some will work better in other circumstances. It is up to the language designers to try and choose a garbage collector that they think will work best for their language.

What version of Lua is being used in the games you are basing this claim on? When World of Warcraft switched from Lua 5.0 to 5.1, all the performance issues caused by garbage collection were severely diminished.
With Lua 5.0's garbage collection, the amount of time spent collecting garbage (and blocking anything else from happening at the same time) was proportional to the amount of memory currently in use, leading to lots of effort to minimize the memory usage of WoW addons.
With Lua 5.1's garbage collection, the collector changed to being incremental so it doesn't lock up the game while collecting garbage like it previously did. Now garbage collection has a very minimal impact on performance compared to the larger issue of horribly inefficient code in the majority of user created addons.

In general, reference counting isn't an exact substitute for garbage collection because of the potential of circular references. You might want to read this page on why garbage collection is preferred to reference counting.

You might also be interested in the Lua Gem about optimization which also has a part that handles garbage collection.

Take a look at some of the CPython sources. A good portion of the C code is Py_DECREF and Py_INCREF. That nasty, tedious and error-prone book keeping just goes away in Lua.
If required, there's nothing to stop you writing Lua modules in C that manage any heavy, private allocations manually.

It's a tradeoff. People have explained some reasons some languages (this really has nothing to do with Lua) use collectors, but haven't touched on the drawbacks.
Some languages, notably ObjC, use reference counting exclusively. The huge advantage of this is that deallocation is deterministic--as soon as you let go of the last reference, it's guaranteed that the object will be freed immediately. This is critical when you have memory constraints. With Lua's allocator, if memory constraints require predictable deallocation, you have to add methods to force the underlying storage to be freed immediately, which defeats the point of having garbage collection.
"WuHoUnited" is wrong in saying you can't do this--it works extremely well with ObjC on iOS, and with shared_ptr in C++. You just have to understand the environment you're in, to avoid cycles or break them when necessary.

What are the disadvantages in using Garbage Collection? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Most of the modern languages have built in garbage collection (GC). e.g. Java, .NET languages, Ruby, etc. Indeed GC simplifies application development in many ways.
I am interested to know the limitations / disadvantages of writing applications in GCed languages. Assuming GC implemenation is optimal, I am just wondering we may be limited by GC to take some optimization decisions.

The main disadvantages to using a garbage collector, in my opinion, are:
Non-deterministic cleanup of resources. Sometimes, it is handy to say "I'm done with this, and I want it cleaned NOW". With a GC, this typically means forcing the GC to cleanup everything, or just wait until it's ready - both of which take away some control from you as a developer.
Potential performance issues which arise from non-deterministic operation of the GC. When the GC collects, it's common to see (small) hangs, etc. This can be particularly problematic for things such as real-time simulations or games.

Take it from a C programmer ... it is about cost/benefit and appropriate use
The garbage collection algorithms such as tri-color/mark-and-sweep there is often significant latency between a resource being 'lost' and the physical resource being freed. In some runtimes the GC will actually pause execution of the program to perform garbage collection.
Being a long time C programmer, I can tell you:
a) Manual free() garbage collection is hard -- This is because there is usually a greater error rate in human placement of free() calls than GC algorithms.
b) Manual free() garbage collection costs time -- Does the time spent debugging outweigh the millisecond pauses of a GC? It may be beneficial to use garbage collection if you are writing a game than say an embedded kernel.
But, when you can't afford the runtime disadvantage (right resources, real-time constraints) then performing manual resource allocation is probably better. It may take time but can be 100% efficient.
Try and imagine an OS kernel written in Java? or on the .NET runtime with GC ... Just look at how much memory the JVM accumulates when running simple programs. I am aware that projects exist like this ... they just make me feel a bit sick.
Just bear in mind, my linux box does much the same things today with 3GB RAM than it did when it had 512MB ram years ago. The only difference is I have mono/jvm/firefox etc running.
The business case for GC is clear, but it still makes me uncomfortable alot of the time.
Good books:
Dragon book (recent edition), Modern Compiler Implementation in C

For .NET, there are two disadvantages that I can see.
1) People assume that the GC knows best, but that's not always the case. If you make certain types of allocations, you can cause yourself to experience some really nasty program deaths without direct invokation of the GC.
2) Objects larger than 85k go onto the LOH, or Large Object Heap. That heap is currently NEVER compacted, so again, your program can experience out-of-memory exceptions when really the LOH is not compacted enough for you to make another allocation.
Both of these bugs are shown in code that I posted in this question:
How do I get .NET to garbage collect aggressively?

I am interested to know the limitations / disadvantages of writing applications in GCed languages. Assuming GC implemenation is optimal, I am just wondering we may be limited by GC to take some optimization decisions.
My belief is that automatic memory management imposes a glass ceiling on efficiency but I have no evidence to substantiate that. In particular, today's GC algorithms offer only high throughput or low latency but not both simultaneously. Production systems like .NET and the HotSpot JVM incur significant pauses precisely because they are optimized for throughput. Specialized GC algorithms like Staccato offer much lower latency but at the cost of much lower minimum mutator utilisation and, therefore, low throughput.

If you are confident(good) about your memory management skills, there is no advantage.
The concept was introduced to minimize the time of development and due to lack of experts in programming who thoroughly understood memory.

The biggest problem when it comes to performance (especially on or real-time systems) is, that your program may experience some unexpected delays when GC kicks in. However, modern GC try to avoid this and can be tuned for real time purposes.
Another obvious thing is that you cannot manage your memory by yourself (for instance, allocate on numa local memory), which you may need to do when you implement low-level software.

It is almost impossible to make a non-GC memory manager work in a multi-CPU environment without requiring a lock to be acquired and released every time memory is allocated or freed. Each lock acquisition or release will require a CPU to coordinate its actions with other CPUs, and such coordination tends to be rather expensive. A garbage-collection-based system can allow many memory allocations to occur without requiring any locks or other inter-CPU coordination. This is a major advantage. The disadvantage is that many steps in garbage collection require that the CPU's coordinate their actions, and getting good performance generally requires that such steps be consolidated to a significant degree (there's not much benefit to eliminating the requirement of CPU coordination on each memory allocation if the CPUs have to coordinate before each step of garbage collection). Such consolidation will often cause all tasks in the system to pause for varying lengths of time during collection; in general, the longer the pauses one is willing to accept, the less total time will be needed for collection.
If processors were to return to a descriptor-based handle/pointer system (similar to what the 80286 used, though nowadays one wouldn't use 16-bit segments anymore), it would be possible for garbage collection to be done concurrently with other operations (if a handle was being used when the GC wanted to move it, the task using the handle would have to be frozen while the data was copied from its old address to its new one, but that shouldn't take long). Not sure if that will ever happen, though (Incidentally, if I had my druthers, an object reference would be 32 bits, and a pointer would be an object reference plus a 32-bit offset; I think it will be awhile before there's a need for over 2 billion objects, or for any object over 4 gigs. Despite Moore's Law, if an application would have over 2 billion objects, its performance would likely be improved by using fewer, larger, objects. If an application would need an object over 4 gigs, its performance would likely be improved by using more, smaller, objects.)

Typically, garbage collection has certain disadvantages:
Garbage collection consumes computing resources in deciding what memory is to be freed, reconstructing facts that may have been known to the programmer. The penalty for the convenience of not annotating object lifetime manually in the source code is overhead, often leading to decreased or uneven performance. Interaction with memory hierarchy effects can make this overhead intolerable in circumstances that are hard to predict or to detect in routine testing.
The point when the garbage is actually collected can be unpredictable, resulting in stalls scattered throughout a session. Unpredictable stalls can be unacceptable in real-time environments such as device drivers, in transaction processing, or in interactive programs.
Memory may leak despite the presence of a garbage collector, if references to unused objects are not themselves manually disposed of. This is described as a logical memory leak.[3] For example, recursive algorithms normally delay release of stack objects until after the final call has completed. Caching and memoizing, common optimization techniques, commonly lead to such logical leaks. The belief that garbage collection eliminates all leaks leads many programmers not to guard against creating such leaks.
In virtual memory environments typical of modern desktop computers, it can be difficult for the garbage collector to notice when collection is needed, resulting in large amounts of accumulated garbage, a long, disruptive collection phase, and other programs' data swapped out.
Perhaps the most significant problem is that programs that rely on garbage collectors often exhibit poor locality (interacting badly with cache and virtual memory systems), occupy more address space than the program actually uses at any one time, and touch otherwise idle pages. These may combine in a phenomenon called thrashing, in which a program spends more time copying data between various grades of storage than performing useful work. They may make it impossible for a programmer to reason about the performance effects of design choices, making performance tuning difficult. They can lead garbage-collecting programs to interfere with other programs competing for resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string