Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues? - garbage-collection

The standard CMS algorithm starts by making the application undergo a STW pause to calculate the GC-root-set. It then resumes mutator threads and both application and collector threads run concurrently until the marking is done. Any pointer store updated by a mutator-thread is protected by a write-barrier that will add that pointer reference to a write-queue.
When the marking phase is done we then proceed to the Remarking phase: it must then look into this write-queue and proceed to mark anything it finds there that was not already marked.
All of this makes sense. What I fail to understand is why would we need to:
Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?;
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
When reading one of the original papers on CMS A Generational Mostly-concurrent Garbage Collector one can see:
The original mostly-concurrent algorithm, proposed by
Boehm et al. [5], is a concurrent “tricolor” collector [9]. It
uses a write barrier to cause updates of fields of heap objects
to shade the containing object gray. Its main innovation is
that it trades off complete concurrency for better throughput, by allowing root locations (globals, stacks, registers),
which are usually updated more frequently than heap locations, to be written without using a barrier to maintain
the tricolor invariant.
it makes it look like this is just a trade-off emanating from a conscious decision to not involve what's happening on the stack in the write-barriers?
Thanks

Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?
No, tricolor marking marks live objects (objects unmarked by then "grey" set is exhausted are unreachable). Remark add rediscovered root objects to "grey" set together with all reference caught by write-barrier, so more objects could be marked as live.
In summary, after CMS remark all live objects are marked, though some dead objects could be marked too.
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
Yes, remark is STW pause in CMS algorithm in HotSpot JVM (you can read more about CMS phases here).
And answering question from title
Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues?
CMS does not use "mutator's write-queues", it does utilize card marking write barrier (shared with young generation copy collector).
Generally all algorithms using write barriers need STW pause to avoid "turtle and arrow" paradox.
CMS starts initial tri-color marking. Then it completed "some" live objects are marked, but due to concurrent modifications marking could miss certain objects. Though write-barrier captures all mutations, thus "pre clean" add all mutated references to "gray" set and resume marking reaching missed objects. Though for this process to converge, final remark with mutator stopped is required.

Related

what do v8's IncrementalMarking and ProcessWeakCallbacks garbage collections do?

I have implemented v8's garbage collection callbacks (prologue and epilogue) and am recording the time taken by garbage collection as well as the counts of each type. Everything I've read on v8 talks about major GCs (Mark/Sweep/Compact) and minor GCs (Scavenge). But there are two additional types both of which generate callbacks as well. From the v8 code:
enum GCType {
kGCTypeScavenge = 1 << 0,
kGCTypeMarkSweepCompact = 1 << 1,
kGCTypeIncrementalMarking = 1 << 2,
kGCTypeProcessWeakCallbacks = 1 << 3,
kGCTypeAll = kGCTypeScavenge | kGCTypeMarkSweepCompact |
kGCTypeIncrementalMarking | kGCTypeProcessWeakCallbacks
};
One odd thing about IncrementalMarking and ProcessWeakCallbacks is that their callbacks are always called the exact same number of times as the MarkSweepCompact callback.
My question is what are the IncrementalMarking and ProcessWeakCallbacks garbage collections? And also, why are they always invoked the same number of times as the MarkSweepCompact garbage collection (should they be considered part of that collection type)?
(V8 developer here.) Yes, "IncrementalMarking" and "ProcessWeakCallbacks" are not types of GC, but phases of major GC cycles. (I don't know why that enum happens to be called GCType, probably for historical reasons.)
I am recording the time taken by garbage collection as well as the counts of each type
Note that the GC callbacks are neither intended nor suitable for time measurements. In particular, incremental marking (as the name implies) happens in many tiny incremental steps, but you only get one invocation of the callback before the first of these steps happens; after that incremental marking steps and program execution will be interleaved until marking is done.
Further, note that the team is working on moving as much of the GC work as possible into background threads, which makes the whole question of "how much time did it take?" somewhat ill-defined.
For offline investigation purposes, your best bet is the --trace-gc flag, which should provide accurate and complete timing information.
For online bookkeeping (as at V8 garbage collector callbacks for measuring GC activity, see also my detailed answer there), I'm afraid there is no good solution.

Lockless game engine with complete seperation of update and render

I apologize up front for this long post, but as you can probably see I have been thinking about this for quite some time, and I feel I need some input from other people before my head explodes :-)
I have been experimenting for some time now with various ways of building a game engine which satifies all the following criteria:
Complete seperation of object updating and object rendering
Full determinism
Updating and rendering at individual speeds
No blocking on shared resources
Complete seperation of object updating and object rendering
Seperation of object updating and object rendering seems to be vital to ensure optimal usage of resources while sending data to the graphics API and swapping buffers.
Even if you want to ensure full parallelism to use multiple cores of a CPU it seems that this seperation must still be managed.
Full determinism
Many game types, and especially multiplayer versions, must ensure full determinism. Otherwise players will experience different states of the same game effectively breaking the game logic. Determinism is required for game replays as well. And it is useful for other purposes where it is important that each run of a simulation produces the same result every time given the same starting conditions and inputs.
Updating and rendering at individual speeds
This is really a prerequisite for full determinism as you cannot have the simulation depend on rendering speeds (ie the various monitor refresh rates, graphics adapter speed etc.). During optimal conditions the update speed should be set at a certain fixed interval (eg. 25 updates per second - maybe less depending on the update type), and the rendering speed should be whatever the client's monitor refresh rate / graphics adapter allows.
This implies that rendering speed higher that update speed should be allowed. And while that sounds like a waste there are known tricks to ensure that the added rendering cycles are not wastes (interpolation / extrapolation) which means that faster monitors / adapters would be rewarded with a more visually pleasing experience as they should.
Rendering speeds lower than update speed must also be allowed though, even if this does in fact result in wasted updating cycles - at least the added updating cycles are not all presented to the user. This is however necessary to ensure a smooth multiplayer experience even if the rendering in one of the clients slows to a sudden crawl for one reason or another.
No blocking on shared resources
If the other criterias mentioned above are to be implemented it must also follow that we cannot allow rendering to be waiting for updating or vice versa. Of course it is painfully obvious that when 2 different threads share access to resources and one thread is updating some of these resources then it is impossible to guarantee that blocking will never take place. It is, however, possible to keep this blocking at an absolute minimum - for example when switching pointer references between queue of updated object and a queue of previously rendered objects.
So...
My question to all you skilled people in here is: Am I asking for too much?
I have been reading about ideas of these various topics on many sites. But always it seems that one part or the other is left out from the suggestions I've seen. And maybe the reason is that you cannot have it all without compromise.
I started this seemingly common quest a long time ago when I was putting my thoughts about it in this thread:
Thoughts about rendering loop strategies
Back then my first naive assumption was that it shouldn't matter if updating and reading happened simultaneously since this variations object state was so small that you shouldn't notice if one object was occasionally a step ahead of the other.
Now I am somewhat wiser, but still confused at times.
The most promising and detailed description of a method that would allow for all my wishes to come through was this:
http://blog.slapware.eu/game-engine/programming/multithreaded-renderloop-part1/
A three-state model that will ensure that the renderer can always choose a new queue for rendering without any wait (except perhaps a micro-second while switching pointer-references). At the same time the updater can alway gain access to 2 queues required for building the next state tree (1 queue for creating/updating the next state, and 1 queue for reading the previsous - which can be done even while the renderer reads it as well).
I recently found time to make a sample implementation of this, and it works very well, but for two issues.
One is a minor issue of having to deal with multiple references to all involved objects
The other is more serious (unless I'm just being too needy). And that is the fact that extrapolation - as opposed to intrapolation - is used to maintain a visually pleasing representation of the states given a fast screen refresh rate. While both methods do the job of showing states deviating from the solidly calculated object states, extrapolation seems to me to produce much more visible artifacts when the predictions fail to represent reality. My position seems to be supported by this:
http://gafferongames.com/networked-physics/snapshots-and-interpolation/
And it is not possible to implement interpolation in the three-state design as far as I can tell, since it requires the renderer to have read-access to 2 queues at all times to calculate the intermediate state between two known states.
So I was toying with extending the three-state model suggested on the slapware-blog to utilize interpolation instead of extrapolation - and at the same time try to simplify the multi-reference structur. While it seems to me to be possible, I am wondering if the price is too high. In order to meet all my goals I would need to have
2 queues (or states) exclusively held by the renderer (they could be used by another thread for read-only purposes, but never updated, or switched during rendering
1 queue (or state) with the newest updated state ready to switch over to the renderer, when it is done rendering the current scene
1 queue (or state) with the next frame being built/updated by the updater
1 queue (or state) containing a copy of the frame last built/updated. This is the same state as last sent to the renderer, so this queue/state should be accessible by both the updater for reading the previous state and the renderer for rendering the state.
So that would mean that I should keep at all times 4 copies of render states to be able to keep this design running smoothly, locklessly, deterministically.
I fear that I'm overthinking this. So if any of you have advise to pull me back on the ground, or advises of what can be improved, critique of the design, or perhaps references to good resources explaining how these goals can be achieved, or why this is or isn't a good idea - please hit me with them :-)

Garbage collect certain object

Is is possible to garbage collect a certain object in Pharo?
E.g. I know that certain object is not (should be not) referenced by any other object. And it takes a lot of space. Does it make sense to just run general garbage collect on system? Or it is possible to remove from heap just specific object/tree
Smalltalk garbage collectors can't garbage-collect just a single object.
There are two basic techniques used - generation scavenging and mark and sweep. Generation scavenging works on new and relatively new objects by copying the used objects into another unused space and ignoring all the garbage. Objects that get copied a lot of times are moved to "old space". Old space is garbage collected by a mark and sweep algorithm. This algorithm loops through all Smalltalk objects and marks them as "unmarked". It then traverses through all accessible objects and marks them as "marked". In the final sweep, anything that's still marked as "unmarked" is freed.
There's no way to run either algorithm on a single object.
No, it does not makes sense, and is not possible.
Also it does not make sense to manually run the garbage collector (which you can do, of course)... system should run gc when needed and you will get that space back.
The whole purpose of a gc is that you do not have to take care about that.
I think you're looking for a reference list.
(i.e. which object is keeping your object not garbage collected).
Might be a Global variable somewhere. Something in a class variable....

Threads - Message Passing

I was trying to find some resources for best performance and scaling with message passing. I heard that message passing by value instead of reference can be better scalability as it works well with NUMA style setups and reduced contention for a given memory address.
I would assume value based message passing only works with "smaller" messages. What would "smaller" be defined as? At what point would references be better? Would one do stream processing this way?
I'm looking for some helpful tips or resources for these kinds of questions.
Thanks :-)
P.S. I work in C#, but I don't think that matters so much for these kind of design questions.
Some factors to add to the excellent advice of Jeremy:
1) Passing by value only works efficiently for small messages. If the data has a [cache-line-size] unused area at the start to avoid false sharing, you are already approaching the size where passing by reference is more efficient.
2) Wider queues mean more space taken up by the queues, impacting memory use.
3) Copying data into/outof wide queue structures takes time. Apart from the actual CPU use while moving data, the queue remains locked during the copying. This increases contention on the queue and leading to an overall performance hit that is queue width dependent. If there is any deadlock-potential in your code, keeping locks for extended periods will not help matters.
4) Passing by value tends to lead to code that is specific to the data size, ie. is fixed at compile-time. Apart from a nasty infestation of templates, this makes it very difficult to tune buffer-sizes etc. at run-time.
5) If the messages are passed by reference and malloced/freed/newed/disposed/GC'd, this can lead to excessive contention on the memory-manager and frequent, wasteful GC. I usually use fixed pools of messages, allocated at startup, specifically to avoid this.
6) Handling byte-streams can be awkward when passing by reference. If a byte-stream is characterized by frequent delivery of single bytes, pass-by-reference is only sensible if the bytes are chunked-up. This can lead to the need for timeouts to ensure that partially-filled messages are dispatched to the next thread in a timely manner. This introduces complication and latency.
7) Pass-by-reference designs are inherently more likely to leak. This can lead to extended test times and overdosing on valgrind - a particularly painful addiction, (another reason I use fixed-size message object pools).
8) Complex messages, eg. those that contain references to other objects, can cause horrendous problems with ownership and lifetime-management if passed by value. Example - a server socket object has a reference to a buffer-list object that contains an array of buffer-instances of varying size, (real example from IOCP server). Try passing that by value..
9) Many OS calls cannot handle anything but a pointer. You cannot PostMessage, (that's a Windows API, for all you happy-feet), even a 256-byte structure by value with one call, (you have just the 2 wParam,lParam integers). Calls that set up asychronous callbacks often allow 'context data' to be sent to the callback - almost always just one pointer. Any app that is going to use such OS functionality is almost forced to resort to pass by reference.
Jeremy Friesner's comment seems to be the best as this is a new area, although Martin James's points are also good. I know Microsoft is looking into message passing for their future kernels as we gain more cores.
There seems to be a framework that deals with message passing and it claims to have much better performance than current .Net producer/consumer generics. I'm not sure how it will compare to .Net's Dataflow in 4.5
https://github.com/odeheurles/Disruptor-net

How does the Garbage Collector decide when to kill objects held by WeakReferences?

I have an object, which I believe is held only by a WeakReference. I've traced its reference holders using SOS and SOSEX, and both confirm that this is the case (I'm not an SOS expert, so I could be wrong on this point).
The standard explanation of WeakReferences is that the GC ignores them when doing its sweeps. Nonetheless, my object survives an invocation to GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced).
Is it possible for an object that is only referenced with a WeakReference to survive that collection? Is there an even more thorough collection that I can force? Or, should I re-visit my belief that the only references to the object are weak?
Update and Conclusion
The root cause was that there was a reference on the stack that was locking the object. It is unclear why neither SOS nor SOSEX was showing that reference. User error is always a possibility.
In the course of diagnosing the root cause, I did do several experiments that demonstrated that WeakReferences to 2nd generation objects can stick around a surprisingly long time. However, a WRd 2nd gen object will not survive GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced).
As per wikipedia "An object referenced only by weak references is considered unreachable (or "weakly reachable") and so may be collected at any time. Weak references are used to avoid keeping memory referenced by unneeded objects"
I am not sure if your case is about weak references...
Try calling GC.WaitForPendingFinalizers() right after GC.Collect().
Another possible option: don't ever use a WeakReference for any purpose. In the wild, I've only ever seen them used as a mechanism for lowering an application's memory footprint (i.e. a form of caching). As the mighty MSDN says:
Avoid using weak references as an
automatic solution to memory
management problems. Instead, develop
an effective caching policy for
handling your application's objects.
I recommend you to check for the "other" references to the weakly referenced objects. Because, if there is another reference still alive, the objects won't be GCed.
Weakly referenced objects do get removed by garbage collection.
I've had the pleasure of debugging event systems where events were not getting fired... It turned out to be because the subscriber was only weakly referenced and so after some eventual random delay the GC would eventually collect it. At which point the UI stopped updating. :)
Yes it is possible. If the WeakReference is located in another generation than the one being collected, for example, if it is in the 2nd Generation, and the GC only does a Gen 0 collection; it will survive. It should not survive a full 2nd Gen collection that completes and where all finalizers run, however.

Resources