d garbage collector and realtime applications

d garbage collector and realtime applications - garbage-collection

I'm thinking about learning D (basically "C++ done right and with garbage collection and message passing between threads") and talked to a colleague who's been long-time C++ programmer and basically he complained that the garbage collector as such has severe timing issues even in soft realtime type applications.
It's not that I need to write realtime app - far from it - but I'm curious how problematic GC would be in developing, say, database? (abstracting from additional memory usage overhead that GC seems to impose, statistically)
(now I know that GC can be turned off in D but that's like saying you can get rid of problems related to a car by getting rid of a car - true but that's not the solution I'd like to choose)
Is this true? How severe are such issues in practice? Is developing, say, a device driver in D and with use of GC is practical/sensible/good practice?

While D has a GC, it does not force you to use it for everything. D also has structs, which act like C++ classes&structs(minus the polymorphism).
In modern managed languages, the GC is not a problem as long as you have enough memory. This is also true for unmanaged languages like C++ - but in C++, running out of memory means you can't allocate any more memory, while in Java running out of memory means a delay while the GC kicks in.
So, if you are planning to allocate tons of objects then yes - the GC can be a problem. But you probably don't really need to allocate so many objects. In Java, you have to use objects to store things like strings and dates and coordinates - and that can really fill up your heap and invoke the GC(luckily, modern JVM use generational GC to optimize those types of objects). In D, you'll just use structs for these things, and only use classes for cases that actually require GC.
As a rule of thumb, you'll usually want to use structs wherever you can, but if you find yourself doing something special to take care of deallocating or to prevent copying&destructing(though it's really fast in D) - make that type a class without a second thought.

I personally don't really approve of the statement "as long as you have enough memory a GC is not a problem". I mean, that basically means, you goahead and waste your memory instead of properly taking care of it and when it's out you suddenly have to wait 1> second for the GC to collect everything.
For one thing, that only happens if it's a really bad GC. The GC in c# for example collects objects extremly fast and often. You won't get a problem, even if you allocate in an often used function and it won't wait till you run out of memory to do a collection.
I am not fully up to date on the current features of the D2 GC (we use D1) but the behavior at the time was that it would allocate a pool of memory and for each of your allocations it would give you some of it. When it has given out 90% and you need more it would start a collection and/or allocate more from the system. (or something like that). There is (for D1) also the concurrent GC which would start collections earlier, but having them run in the background, but it is linux-only as it uses the fork syscall.
So, I think the current D GC can cause small but noticable freezes if not used with care. But you can disable/enable it, e.g. when you do something real-time critical, disable it, when that critical part of the code is over, enable it again.
For a database, I don't think the D GC is ready yet. I would heavily re-use memory and not rely on the GC at all for that kind of application.

Related

Why is garbage collection necessary?

Suppose that an object on the heap goes out of scope. Why can't the program free the memory right after the scope ends? Or, if we have a pointer to an object that is replaced by the address to a new object, why can't the program deallocate the old one before assigning the new one? I'm guessing that it's faster not to free it immediately and instead have the freeing be done asynchronously at a later point in time, but I'm not really sure.

Why is garbage collection necessary?
It is not strictly necessary. Given enough time and effort you can always translate a program that depends on garbage collection to one that doesn't.
In general, garbage collection involves a trade-off.
On the one hand, garbage collection allows you to write an application without worrying about the details of memory allocation and deallocation. (And the pain of debugging crashes and memory leaks caused by getting the deallocation logic wrong.)
The downside of garbage collection is that you need more memory. A typical garbage collector is not efficient if it doesn't have plenty of spare space1.
By contrast, if you do manual memory management, you can code your application to free up heap objects as soon as they are no longer used. Furthermore, you don't get awkward "pauses" while the GC is doing its thing.
The downside of manual memory management is that you have to write the code that decides when to call free, and you have to get it correct. Furthermore, if you try to manage memory by reference counting:
you have the cost of incrementing and decrementing ref counts whenever pointers are assign or variables go out of scope,
you have to deal with cycles in your data structures, and
it is worse when your application is multi-threaded and you have to deal with memory caches, synchronization, etc.
For what it is worth, if you use a decent garbage collector and tune it appropriately (e.g. give it enough memory, etc) then the CPU costs of GC and manual storage management are comparable when you apply them to a large application.
Reference:
"The measured cost of conservative garbage collection" by Benjamin Zorn
1 - This is because the main cost of a modern collector is in traversing and dealing with the non-garbage objects. If there is not a lot of garbage because you are being miserly with the heap space, the GC does a lot of work for little return. See https://stackoverflow.com/a/2414621/139985 for an analysis.

It's more complicated, but
1) what if there is memory pressure before the scope is over? Scope is only a language notion, not related to reachability. So an object can be "freed" before it goes out of scope ( java GCs do that on regular basis). Also, if you free objects after each scope is done, you might be doing too little work too often
2) As far as the references go, you are not considering that the reference might have hierarchies and when you change one, there has to be code that traverses those. It might not be the right time to do it when that happens.
In general, there is nothing wrong with such a proposal that you describer, as a matter of fact this is almost exactly how Rust programming language works, from a high level point of view.

Could calling core.memory's GC.collect consistently make for a more consistent framerate?

I'm looking into making a real time game with OpenGL and D, and I'm worried about the garbage collector. From what I'm hearing, this is a possibility:
10 frames run
Garbage collector runs kicks in automatically and runs for 10ms
10 frames run
Garbage collector runs kicks in automatically and runs for 10ms
10 frames run
and so on
This could be bad because it causes stuttering. However, if I force the the garbage collector to run consistently, like with GC.collect, will it make my game smoother? Like so:
1 frame runs
Garbage collector runs for 1-2ms
1 frame runs
Garbage collector runs for 1-2ms
1 frame runs
and so on
Would this approach actually work and make my framerate more consistent? I'd like to use D but if I can't make my framerate consistent then I'll have to use C++11 instead.
I realize that it might not be as efficient, but the important thing is that it will be smoother, at a more consistent framerate. I'd rather have a smoothe 30 fps than a stuttering 35 fps, if you know what I mean.

Yes, but it will likely not make a dramatic difference.
The bulk of time spent in a GC cycle is the "mark" stage, where the GC visits every allocated memory block (which is known to contain pointers) transitively, from the root areas (static data, TLS, stack and registers).
There are several approaches to optimize an application's memory so that D's GC makes a smaller impact on performance:
Use bulk allocation (allocate objects in bulk as arrays)
Use custom allocators (std.allocator is on its way, but you could use your own or third party solutions)
Use manual memory management, like in C++ (you can use RefCounted as you would use shared_ptr)
Avoiding memory allocation entirely during gameplay, and preallocating everything beforehand instead
Disabling the GC, and running collections manually when it is more convenient
Generally, I would not recommending being concerned about the GC before writing any code. D provides the tools to avoid the bulk of GC allocations. If you keep the managed heap small, GC cycles will likely not take long enough to interfere with your application's responsiveness.

If you were to run the GC every frame, you still would not get a smooth run, because you could have different amounts of garbage every frame.
You're left then with two options, both of which involve turning off the GC:
Use (and re-use) pre-allocated memory (structs, classes, arrays, whatever) so that you do not allocate during a frame, and do not need to.
Just run and eat up memory.
For both these, you would do a GC.disable() before you start your frames and then a GC.enable() after you're finished with all your frames (at the end of the battle or whatever).
The first option is the one which most high performance games use anyway, regardless of whether they're written in a language with a GC. They simply do not allocate or de-allocate during the main frame run. (Which is why you get the "loading" and "unloading" before and after battles and the like, and there are usually hard limits on the number of units.)

Using D for a realtime application?

I am considering using d for my ongoing graphics engine. The one thing that turns me down is the GC.
I am still a young programmer and I probably have a lot of misconceptions about GC's and I hope you can clarify some concerns.
I am aiming for low latency and timing in general is crucial. From what I know is that GC's are pretty unpredictable, for example my application could render a frame every 16.6ms and when to GC's kicks in it could go up to any number like 30ms because it is not deterministic right?
I read that you can turn down the GC in D, but then you can't use the majority of D's standard library and the GC is not completely off. is this true?
Do you think it makes sense to use D in a timing crucial application?

Short answer: it requires lot of customization and can be really difficult if you are not an experienced D developer.
List of issues:
Memory management itself is not that big problem. In real-time applications you never ever want to allocate memory in a main loop. Having pre-allocated memory pools for all main data is pretty much de-facto standard way to do such applications. In that sense, D is not different - you still call C malloc directly to get some heap for your pools and this memory won't be managed by a GC, it won't even know about it.
However, certain language features and large parts of Phobos do use GC automagically. For example, you can't really concatenate slices without some form of automatically managed allocation. And Phobos simply has not had a strong policy about this for quite a long time.
Few language-triggered allocations won't be a problem on their own as most memory used is managed via pools anyway. However, there is a killer issue for real-time software in stock D : default D garbage collector is stop-the-world. Even if there is almost no garbage your whole program will hit a latency spike when collection cycle is ran, as all threads get blocked.
What can be done:
1) Use GC.disable(); to switch off collection cycles. It will solve stop-the-world issue but now your program will start to leak memory in some cases, as GC-based allocations still work.
2) Dump hidden GC allocations. There was a pull request for -vgc switch which I can't find right now, but in its absence you can compile your own druntime version that prints backtrace upon gc_malloc() call. You may want to run this as part of automatic test suite.
3) Avoid Phobos entirely and use something like https://bitbucket.org/timosi/minlibd as an alternative.
Doing all this should be enough to target soft real-time requirements typical for game dev, but as you can see it is not simple at all and requires stepping out of stock D distribution.
Future alternative:
Once Leandro Lucarella ports his concurrent garbage collector to D2 (which is planned, but not scheduled), situation will become much more simple. Small amount of GC-managed memory + concurrent implementation will allow to meet soft real-time requirements even without disabling GC. Even Phobos can be used after it is stripped from most annoying allocations. But I don't think it will happen any soon.
But what about hard real-time?
You better not even try. But that is yet another story to tell.

If you do not like GC - disable it.
Here is how:
import core.memory;
void main(string[] args) {
GC.disable;
// your code here
}
Naturally, then you will have to do the memory manage yourself. It is doable, and there are several articles about it. It has been discussed here too, I just do not remember the thread.
dlang.org also has useful information about this. This article, http://dlang.org/memory.html , touches the topic of real-time programming and you should read it.
Yet another good article: http://3d.benjamin-thaut.de/?p=20 .

How can garbage collectors be faster than explicit memory deallocation?

I was reading this html generated, (may expire, Here is the original ps file.)
GC Myth 3: Garbage collectors are always slower than explicit memory deallocation.
GC Myth 4: Garbage collectors are always faster than explicit memory deallocation.
This was a big WTF for me. How would GC be faster then explicit memory deallocation? isnt it essentially calling a explicit memory deallocator when it frees the memory/make it for use again? so.... wtf.... what does it actually mean?
Very small objects & large sparse
heaps ==> GC is usually cheaper,
especially with threads
I still don't understand it. Its like saying C++ is faster then machine code (if you don't understand the wtf in this sentence please stop programming. Let the -1 begin). After a quick google one source suggested its faster when you have a lot of memory. What i am thinking is it means it doesn't bother will the free at all. Sure that can be fast and i have written a custom allocator that does that very thing, not free at all (void free(void*p){}) in ONE application that doesnt free any objects (it only frees at end when it terminates) and has the definition mostly in case of libs and something like stl. So... i am pretty sure this will be faster the GC as well. If i still want free-ing i guess i can use an allocator that uses a deque or its own implementation thats essentially
if (freeptr < someaddr) {
*freeptr=ptr;
++freeptr;
}
else
{
freestuff();
freeptr = freeptrroot;
}
which i am sure would be really fast. I sort of answered my question already. The case the GC collector is never called is the case it would be faster but... i am sure that is not what the document means as it mention two collectors in its test. i am sure the very same application would be slower if the GC collector is called even once no matter what GC used. If its known to never need free then an empty free body can be used like that one app i had.
Anyways, i post this question for further insight.

How would GC be faster then explicit memory deallocation?
GCs can pointer-bump allocate into a thread-local generation and then rely upon copying collection to handle the (relatively) uncommon case of evacuating the survivors. Traditional allocators like malloc often compete for global locks and search trees.
GCs can deallocate many dead blocks simultaneously by resetting the thread-local allocation buffer instead of calling free on each block in turn, i.e. O(1) instead of O(n).
By compacting old blocks so more of them fit into each cache line. The improved locality increases cache efficiency.
By taking advantage of extra static information such as immutable types.
By taking advantage of extra dynamic information such as the changing topology of the heap via the data recorded by the write barrier.
By making more efficient techniques tractable, e.g. by removing the headache of manual memory management from wait free algorithms.
By deferring deallocation to a more appropriate time or off-loading it to another core. (thanks to Andrew Hill for this idea!)

One approach to make GC faster then explicit deallocation is to deallocate implicitly :the heap is divided in partitions, and the VM switches between the partitions from time to time (when a partition gets too full for example). Live objects are copied to the new partition and all the dead objects are not deallocated - they are just left forgotten. So the deallocation itself ends up costing nothing. The additional benefit of this approach is that the heap defragmentation is a free bonus.Please note this is a very general description of the actual processes.

The trick is, that the underlying allocator for garbage collector can be much simpler than the explicit one and take some shortcuts that the explicit one can't.
If the collector is copying (java and .net and ocaml and haskell runtimes and many others actually use one), freeing is done in big blocks and allocating is just pointer increment and cost is payed per object surviving collection. So it's faster especially when there are many short-lived temporary objects, which is quite common in these languages.
Even for a non-copying collector (like the Boehm's one) the fact that objects are freed in batches saves a lot of work in combining the adjacent free chunks. So if the collection does not need to be run too often, it can easily be faster.
And, well, many standard library malloc/free implementations just suck. That's why there are projects like umem and libraries like glib have their own light-weight version.

A factor not yet mentioned is that when using manual memory allocation, even if object references are guaranteed not to form cycles, determining when the last entity to hold a reference has abandoned it can be expensive, typically requiring the use of reference counters, reference lists, or other means of tracking object usage. Such techniques aren't too bad on single-processor systems, where the cost of an atomic increment may be essentially the same as an ordinary one, but they scale very badly on multi-processor systems, where atomic-increment operations are comparatively expensive.

Lua's GC and realtime game

As I know, the tracing GC can't avoid thread blocking during complete GC.
I had used XNA+C#, and GC time were impossible to remove. So I switched to lower level language C, but I realized I need scripting language. I'm considering Lua, but I'm worrying about Lua's GC mechanism. Lua is using incremental tracing GC, and thread blocking should be too.
So how should I handle this in realtime game?

The power of Lua is that it gets out of your way. Want classes? That can be build with metatables. Want sandboxing? use lua_setfenv.
As for the garbage collector. Use it as is first. If later you find performance issues use lua_gc to fine tune its behavior.
Some examples:
Disable the garbage collector during those times when the slow down would be a problem.
Leave the garbage collector disabled and only step it when game logic says you've got some head room on your FPS count. You could pre-tune the step size, or discover the optimal step size at runtime.
Disable the collector and perform full collection at stopping points, ie a load screen or cut scene or at turn change in a hot seat game.
You might also consider an alternative scripting language. Squirrel tries very hard to be a second generation Lua. It tries to keep all of Lua's good features, while ditching any of its design mistakes. One of the big differences between the two is squirrel uses reference counting instead of garbage collection. It turns out that reference counting can be a bit slower than garbage collection but it is very deterministic(AKA realtime).

The correct way to handle this is:
Write a small prototype with just the core things that you want to test.
Profile it a lot, reproducing the different scenarios that could happen in your game (lots of memory available, little memory available, different numbers of threads, that kind of thing)
If you don't find a visible bottleneck, you can use Lua. Otherwise, you will have to look for alternative solutions (maybe Lisp or Javascript)

You can patch the Lua GC so as to time-limit each collection cycle. E.g: http://www.altdevblogaday.com/2011/07/23/predictable-garbage-collection-with-lua/
I believe that it is still possible to have long GC step times when collecting very large tables. Therefore you need to adopt a programming style that avoids large tables.
The following article discusses two strategies for using Lua for real-time robot control (1. don't generate garbage, or 2. using an O(1) allocator and tune when GC collection is run):
https://www.osadl.org/?id=1117

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string