Garbage collector in Node.js - node.js

According to google, V8 uses an efficient garbage collection by employing a "stop-the-world, generational, accurate, garbage collector". Part of the claim is that the V8 stops program execution when performing a garbage collection cycle.
An obvious question is how can you have an efficient GC when you pause program execution?
I was trying to find more about this topic as I would be interested to know how does the GC impacts the response time when you have possibly tens of thounsands requests per second firing your node.js server.
Any expert help, personal experience or links would be greatly appreciated
Thank you

"Efficient" can mean several things. Here it probably refers to high throughput. When looking at response time, you're more interested in latency, which could indeed be worse than with alternative GC strategies.
The main alternatives to stop-the-world GCs are
incremental GCs, which need not finish a collection cycle before handing back control to the mutator1 temporarily, and
concurrent GCs which (virtually) operate at the same time as the mutator, interrupting it only very briefly (e.g. to scan the stack).
Both need to perform extra work to be correct in the face of concurrent modification of the heap (e.g. if a new object is created and attached to an already-scanned object, this new reference must be noticed). This impacts total throughput, i.e., it takes longer to actually clean the entire heap. The upside is that they do not (usually) interrupt the program for very long, if at all, so latency is low(er).
Although the V8 documentation still mentions a stop-the-world collector, it seems that an the V8 GC is incremental since 2011. So while it does stop program execution once in a while, it does not 2 stop the program for however long it takes to scan the entire heap. Instead it can scan for, say, a couple milliseconds, and let the program resume.
1 "Mutator" is GC terminology for the program whose heap is garbage collected.
2 At least in principle, this is probably configurable.

Related

Could calling core.memory's GC.collect consistently make for a more consistent framerate?

I'm looking into making a real time game with OpenGL and D, and I'm worried about the garbage collector. From what I'm hearing, this is a possibility:
10 frames run
Garbage collector runs kicks in automatically and runs for 10ms
10 frames run
Garbage collector runs kicks in automatically and runs for 10ms
10 frames run
and so on
This could be bad because it causes stuttering. However, if I force the the garbage collector to run consistently, like with GC.collect, will it make my game smoother? Like so:
1 frame runs
Garbage collector runs for 1-2ms
1 frame runs
Garbage collector runs for 1-2ms
1 frame runs
and so on
Would this approach actually work and make my framerate more consistent? I'd like to use D but if I can't make my framerate consistent then I'll have to use C++11 instead.
I realize that it might not be as efficient, but the important thing is that it will be smoother, at a more consistent framerate. I'd rather have a smoothe 30 fps than a stuttering 35 fps, if you know what I mean.
Yes, but it will likely not make a dramatic difference.
The bulk of time spent in a GC cycle is the "mark" stage, where the GC visits every allocated memory block (which is known to contain pointers) transitively, from the root areas (static data, TLS, stack and registers).
There are several approaches to optimize an application's memory so that D's GC makes a smaller impact on performance:
Use bulk allocation (allocate objects in bulk as arrays)
Use custom allocators (std.allocator is on its way, but you could use your own or third party solutions)
Use manual memory management, like in C++ (you can use RefCounted as you would use shared_ptr)
Avoiding memory allocation entirely during gameplay, and preallocating everything beforehand instead
Disabling the GC, and running collections manually when it is more convenient
Generally, I would not recommending being concerned about the GC before writing any code. D provides the tools to avoid the bulk of GC allocations. If you keep the managed heap small, GC cycles will likely not take long enough to interfere with your application's responsiveness.
If you were to run the GC every frame, you still would not get a smooth run, because you could have different amounts of garbage every frame.
You're left then with two options, both of which involve turning off the GC:
Use (and re-use) pre-allocated memory (structs, classes, arrays, whatever) so that you do not allocate during a frame, and do not need to.
Just run and eat up memory.
For both these, you would do a GC.disable() before you start your frames and then a GC.enable() after you're finished with all your frames (at the end of the battle or whatever).
The first option is the one which most high performance games use anyway, regardless of whether they're written in a language with a GC. They simply do not allocate or de-allocate during the main frame run. (Which is why you get the "loading" and "unloading" before and after battles and the like, and there are usually hard limits on the number of units.)

Using D for a realtime application?

I am considering using d for my ongoing graphics engine. The one thing that turns me down is the GC.
I am still a young programmer and I probably have a lot of misconceptions about GC's and I hope you can clarify some concerns.
I am aiming for low latency and timing in general is crucial. From what I know is that GC's are pretty unpredictable, for example my application could render a frame every 16.6ms and when to GC's kicks in it could go up to any number like 30ms because it is not deterministic right?
I read that you can turn down the GC in D, but then you can't use the majority of D's standard library and the GC is not completely off. is this true?
Do you think it makes sense to use D in a timing crucial application?
Short answer: it requires lot of customization and can be really difficult if you are not an experienced D developer.
List of issues:
Memory management itself is not that big problem. In real-time applications you never ever want to allocate memory in a main loop. Having pre-allocated memory pools for all main data is pretty much de-facto standard way to do such applications. In that sense, D is not different - you still call C malloc directly to get some heap for your pools and this memory won't be managed by a GC, it won't even know about it.
However, certain language features and large parts of Phobos do use GC automagically. For example, you can't really concatenate slices without some form of automatically managed allocation. And Phobos simply has not had a strong policy about this for quite a long time.
Few language-triggered allocations won't be a problem on their own as most memory used is managed via pools anyway. However, there is a killer issue for real-time software in stock D : default D garbage collector is stop-the-world. Even if there is almost no garbage your whole program will hit a latency spike when collection cycle is ran, as all threads get blocked.
What can be done:
1) Use GC.disable(); to switch off collection cycles. It will solve stop-the-world issue but now your program will start to leak memory in some cases, as GC-based allocations still work.
2) Dump hidden GC allocations. There was a pull request for -vgc switch which I can't find right now, but in its absence you can compile your own druntime version that prints backtrace upon gc_malloc() call. You may want to run this as part of automatic test suite.
3) Avoid Phobos entirely and use something like https://bitbucket.org/timosi/minlibd as an alternative.
Doing all this should be enough to target soft real-time requirements typical for game dev, but as you can see it is not simple at all and requires stepping out of stock D distribution.
Future alternative:
Once Leandro Lucarella ports his concurrent garbage collector to D2 (which is planned, but not scheduled), situation will become much more simple. Small amount of GC-managed memory + concurrent implementation will allow to meet soft real-time requirements even without disabling GC. Even Phobos can be used after it is stripped from most annoying allocations. But I don't think it will happen any soon.
But what about hard real-time?
You better not even try. But that is yet another story to tell.
If you do not like GC - disable it.
Here is how:
import core.memory;
void main(string[] args) {
GC.disable;
// your code here
}
Naturally, then you will have to do the memory manage yourself. It is doable, and there are several articles about it. It has been discussed here too, I just do not remember the thread.
dlang.org also has useful information about this. This article, http://dlang.org/memory.html , touches the topic of real-time programming and you should read it.
Yet another good article: http://3d.benjamin-thaut.de/?p=20 .

What's the Gambit-C's GC mechanism?

What's the Gambit-C's GC mechanism? I'm curious about this for making interactive app. I want to know whether it can avoid burst GC operation or not.
According to these threads:
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2005-December/000521.html
https://mercure.iro.umontreal.ca/pipermail/gambit-list/2008-September/002645.html
Gambit has traditional stop-the-world GC at least until September 2008. People in thread recommended using pre-allocated object pooling to avoid GC operation itself. I couldn't find out about current implementation.
*It's hard to agree with the conversation. Because I can't pool object not written by myself and finally full-GC will happen at sometime by accumulated small/non-pooled temporary objects. But the method mentioned by #Gregory may help to avoid this problem. However, I wish incremental GC added to Gambit :)
According to http://dynamo.iro.umontreal.ca/~gambit/wiki/index.php/Debugging#Garbage_collection_threshold gambit has some controls:
Garbage collection threshold
Pay attention to the runtime options h (maximum heapsize in kilobytes) and l (livepercent). See the reference manual for more information. Setting livepercent to five means that garbage collection will take place at the time that there are nineteen times more memory allocated for objects that should be garbage collected, than there is memory allocated for objects that should not. The reason the livepercent option is there, is to give a way to control how sparing/generous the garbage collector should be about memory consumption, vs. how heavy/light it should be in CPU load.
You can always force garbage collection by (##gc).
If you force garbage collection after some small number of operations, or schedule it near continuously, or set the livepercent to like 90 then presumably the gc will run frequently and not do very much on each run. This is likely to be more expensive overall, but avoid bursts of expense. You can then fairly easily budget for that expense to make the service fast despite.

How can garbage collectors be faster than explicit memory deallocation?

I was reading this html generated, (may expire, Here is the original ps file.)
GC Myth 3: Garbage collectors are always slower than explicit memory deallocation.
GC Myth 4: Garbage collectors are always faster than explicit memory deallocation.
This was a big WTF for me. How would GC be faster then explicit memory deallocation? isnt it essentially calling a explicit memory deallocator when it frees the memory/make it for use again? so.... wtf.... what does it actually mean?
Very small objects & large sparse
heaps ==> GC is usually cheaper,
especially with threads
I still don't understand it. Its like saying C++ is faster then machine code (if you don't understand the wtf in this sentence please stop programming. Let the -1 begin). After a quick google one source suggested its faster when you have a lot of memory. What i am thinking is it means it doesn't bother will the free at all. Sure that can be fast and i have written a custom allocator that does that very thing, not free at all (void free(void*p){}) in ONE application that doesnt free any objects (it only frees at end when it terminates) and has the definition mostly in case of libs and something like stl. So... i am pretty sure this will be faster the GC as well. If i still want free-ing i guess i can use an allocator that uses a deque or its own implementation thats essentially
if (freeptr < someaddr) {
*freeptr=ptr;
++freeptr;
}
else
{
freestuff();
freeptr = freeptrroot;
}
which i am sure would be really fast. I sort of answered my question already. The case the GC collector is never called is the case it would be faster but... i am sure that is not what the document means as it mention two collectors in its test. i am sure the very same application would be slower if the GC collector is called even once no matter what GC used. If its known to never need free then an empty free body can be used like that one app i had.
Anyways, i post this question for further insight.
How would GC be faster then explicit memory deallocation?
GCs can pointer-bump allocate into a thread-local generation and then rely upon copying collection to handle the (relatively) uncommon case of evacuating the survivors. Traditional allocators like malloc often compete for global locks and search trees.
GCs can deallocate many dead blocks simultaneously by resetting the thread-local allocation buffer instead of calling free on each block in turn, i.e. O(1) instead of O(n).
By compacting old blocks so more of them fit into each cache line. The improved locality increases cache efficiency.
By taking advantage of extra static information such as immutable types.
By taking advantage of extra dynamic information such as the changing topology of the heap via the data recorded by the write barrier.
By making more efficient techniques tractable, e.g. by removing the headache of manual memory management from wait free algorithms.
By deferring deallocation to a more appropriate time or off-loading it to another core. (thanks to Andrew Hill for this idea!)
One approach to make GC faster then explicit deallocation is to deallocate implicitly :the heap is divided in partitions, and the VM switches between the partitions from time to time (when a partition gets too full for example). Live objects are copied to the new partition and all the dead objects are not deallocated - they are just left forgotten. So the deallocation itself ends up costing nothing. The additional benefit of this approach is that the heap defragmentation is a free bonus.Please note this is a very general description of the actual processes.
The trick is, that the underlying allocator for garbage collector can be much simpler than the explicit one and take some shortcuts that the explicit one can't.
If the collector is copying (java and .net and ocaml and haskell runtimes and many others actually use one), freeing is done in big blocks and allocating is just pointer increment and cost is payed per object surviving collection. So it's faster especially when there are many short-lived temporary objects, which is quite common in these languages.
Even for a non-copying collector (like the Boehm's one) the fact that objects are freed in batches saves a lot of work in combining the adjacent free chunks. So if the collection does not need to be run too often, it can easily be faster.
And, well, many standard library malloc/free implementations just suck. That's why there are projects like umem and libraries like glib have their own light-weight version.
A factor not yet mentioned is that when using manual memory allocation, even if object references are guaranteed not to form cycles, determining when the last entity to hold a reference has abandoned it can be expensive, typically requiring the use of reference counters, reference lists, or other means of tracking object usage. Such techniques aren't too bad on single-processor systems, where the cost of an atomic increment may be essentially the same as an ordinary one, but they scale very badly on multi-processor systems, where atomic-increment operations are comparatively expensive.

Which counter can I use in performance monitor to see how much memory is waiting for the GC?

I am trying to profile a specific page of my ASP.NET site to optimize memory usage, but the nature of .NET as a Garbage Collected language is making it tough to get a true picture of what how memory is used and released in the program.
Is there a perfmon counter or other method for profiling that will allow me to see not only how much memory is allocated, but also how much has been released by the program and is just waiting for garbage collection?
Actually nothing in the machine really knows what is waiting for garbage collection: garbage collection is precisely the process of figuring that out and releasing the memory corresponding to dead objects. At best, the GC will have that information only on some very specific instants in its cycle. The detection and release parts are often interleaved (this depends on the GC technology) so it is possible that the GC never has a full count of what could be freed.
For most GC, obtaining such an information is computationally expensive. If you are ready to spend a bit of CPU time on it (it will not be transparent to the application) then you can use GC.Collect() to force the GC to run, immediately followed by a call to GC.GetTotalMemory() to know how much memory has survived the GC. Note that forcing the GC could induce a noticeable pause, and may also decrease overall performance.
This is the "homemade" method; for a more serious analysis, try a dedicated profiler.
The best way that I have been able to profile memory is to use ANTS Profiler from RedGate. You can view a snapshot, what stage of the lifecycle it is in and more. Including actual object values.

Resources