what do v8's IncrementalMarking and ProcessWeakCallbacks garbage collections do? - garbage-collection

I have implemented v8's garbage collection callbacks (prologue and epilogue) and am recording the time taken by garbage collection as well as the counts of each type. Everything I've read on v8 talks about major GCs (Mark/Sweep/Compact) and minor GCs (Scavenge). But there are two additional types both of which generate callbacks as well. From the v8 code:
enum GCType {
kGCTypeScavenge = 1 << 0,
kGCTypeMarkSweepCompact = 1 << 1,
kGCTypeIncrementalMarking = 1 << 2,
kGCTypeProcessWeakCallbacks = 1 << 3,
kGCTypeAll = kGCTypeScavenge | kGCTypeMarkSweepCompact |
kGCTypeIncrementalMarking | kGCTypeProcessWeakCallbacks
};
One odd thing about IncrementalMarking and ProcessWeakCallbacks is that their callbacks are always called the exact same number of times as the MarkSweepCompact callback.
My question is what are the IncrementalMarking and ProcessWeakCallbacks garbage collections? And also, why are they always invoked the same number of times as the MarkSweepCompact garbage collection (should they be considered part of that collection type)?

(V8 developer here.) Yes, "IncrementalMarking" and "ProcessWeakCallbacks" are not types of GC, but phases of major GC cycles. (I don't know why that enum happens to be called GCType, probably for historical reasons.)
I am recording the time taken by garbage collection as well as the counts of each type
Note that the GC callbacks are neither intended nor suitable for time measurements. In particular, incremental marking (as the name implies) happens in many tiny incremental steps, but you only get one invocation of the callback before the first of these steps happens; after that incremental marking steps and program execution will be interleaved until marking is done.
Further, note that the team is working on moving as much of the GC work as possible into background threads, which makes the whole question of "how much time did it take?" somewhat ill-defined.
For offline investigation purposes, your best bet is the --trace-gc flag, which should provide accurate and complete timing information.
For online bookkeeping (as at V8 garbage collector callbacks for measuring GC activity, see also my detailed answer there), I'm afraid there is no good solution.

Related

Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues?

The standard CMS algorithm starts by making the application undergo a STW pause to calculate the GC-root-set. It then resumes mutator threads and both application and collector threads run concurrently until the marking is done. Any pointer store updated by a mutator-thread is protected by a write-barrier that will add that pointer reference to a write-queue.
When the marking phase is done we then proceed to the Remarking phase: it must then look into this write-queue and proceed to mark anything it finds there that was not already marked.
All of this makes sense. What I fail to understand is why would we need to:
Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?;
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
When reading one of the original papers on CMS A Generational Mostly-concurrent Garbage Collector one can see:
The original mostly-concurrent algorithm, proposed by
Boehm et al. [5], is a concurrent “tricolor” collector [9]. It
uses a write barrier to cause updates of fields of heap objects
to shade the containing object gray. Its main innovation is
that it trades off complete concurrency for better throughput, by allowing root locations (globals, stacks, registers),
which are usually updated more frequently than heap locations, to be written without using a barrier to maintain
the tricolor invariant.
it makes it look like this is just a trade-off emanating from a conscious decision to not involve what's happening on the stack in the write-barriers?
Thanks
Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?
No, tricolor marking marks live objects (objects unmarked by then "grey" set is exhausted are unreachable). Remark add rediscovered root objects to "grey" set together with all reference caught by write-barrier, so more objects could be marked as live.
In summary, after CMS remark all live objects are marked, though some dead objects could be marked too.
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
Yes, remark is STW pause in CMS algorithm in HotSpot JVM (you can read more about CMS phases here).
And answering question from title
Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues?
CMS does not use "mutator's write-queues", it does utilize card marking write barrier (shared with young generation copy collector).
Generally all algorithms using write barriers need STW pause to avoid "turtle and arrow" paradox.
CMS starts initial tri-color marking. Then it completed "some" live objects are marked, but due to concurrent modifications marking could miss certain objects. Though write-barrier captures all mutations, thus "pre clean" add all mutated references to "gray" set and resume marking reaching missed objects. Though for this process to converge, final remark with mutator stopped is required.

Do I need synchronization to read and write a common cache file in a multithread environment?

Consider the following algorithm, which is running on multiple threads at the same time:
for (i=0; i<10000; i++) {
z = rand(0,50000);
if (isset(cache[z])) results[z] = cache[z];
else {
result = z*100;
cache[z] = result;
results[z] = result;
}
}
The cache and results are both shared variables among the threads. If this algorithm runs as it is, without synchronization, what kind of errors can occur? If two threads try to write concurrently to cache[z] or results[z] can data be lost, or plain and simply the data will be accepted by the thread that won the 'race-condition'?
A more concrete example of a question: let's say Thread A and Thread B both try to write to cache[10] at the same time the number 1000, and in the same time, Thread C tries to read the data that is in cache[10]. Can the read operation of Thread C finish, in an intermitent sate, let's say, as 100, and then Thread C will continue working with the incorrect data?
USE CASE: A real life use case for which I am asking this question, is hashtabled caches. If all of the Threads will use the same hashtable cache, and they will read and write data from and to it, if the data they write to a specific key will always be the same, do I need to synchronize these read and write operations?
Nobody could possibly know. Different languages, compiler, CPUs, platforms, and threading standards could handle this in entirely different ways. There's no way anyone can know what some future compiler, CPU, or platform might do. Unless the documentation or specification for the language or threading standard says what will happen in this case, there is absolutely no way to know what might happen. Of course, if something you're using guarantees particular behavior in this case, then what is guaranteed to happen will happen (unless it's broken).
At one time, there didn't exist any CPUs that buffered writes such that they could be visible out-of-order. But if you wrote code under the assumption that this meant that writes would never become visible out-of-order, that code would be broken on pretty much every modern platform.
This sad tale repeated over and over with numerous compiler optimizations that people never expected compilers to make but that compilers later made. Some of the aliasing fiascos come to mind.
Making decisions that require you to imagine correctly possible future evolutions of computing seems extremely unwise and has failed repeatedly, sometimes catastrophically, in the past.

Limiting work in progress of parallel operations of a streamed resource

I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.

Threads - Message Passing

I was trying to find some resources for best performance and scaling with message passing. I heard that message passing by value instead of reference can be better scalability as it works well with NUMA style setups and reduced contention for a given memory address.
I would assume value based message passing only works with "smaller" messages. What would "smaller" be defined as? At what point would references be better? Would one do stream processing this way?
I'm looking for some helpful tips or resources for these kinds of questions.
Thanks :-)
P.S. I work in C#, but I don't think that matters so much for these kind of design questions.
Some factors to add to the excellent advice of Jeremy:
1) Passing by value only works efficiently for small messages. If the data has a [cache-line-size] unused area at the start to avoid false sharing, you are already approaching the size where passing by reference is more efficient.
2) Wider queues mean more space taken up by the queues, impacting memory use.
3) Copying data into/outof wide queue structures takes time. Apart from the actual CPU use while moving data, the queue remains locked during the copying. This increases contention on the queue and leading to an overall performance hit that is queue width dependent. If there is any deadlock-potential in your code, keeping locks for extended periods will not help matters.
4) Passing by value tends to lead to code that is specific to the data size, ie. is fixed at compile-time. Apart from a nasty infestation of templates, this makes it very difficult to tune buffer-sizes etc. at run-time.
5) If the messages are passed by reference and malloced/freed/newed/disposed/GC'd, this can lead to excessive contention on the memory-manager and frequent, wasteful GC. I usually use fixed pools of messages, allocated at startup, specifically to avoid this.
6) Handling byte-streams can be awkward when passing by reference. If a byte-stream is characterized by frequent delivery of single bytes, pass-by-reference is only sensible if the bytes are chunked-up. This can lead to the need for timeouts to ensure that partially-filled messages are dispatched to the next thread in a timely manner. This introduces complication and latency.
7) Pass-by-reference designs are inherently more likely to leak. This can lead to extended test times and overdosing on valgrind - a particularly painful addiction, (another reason I use fixed-size message object pools).
8) Complex messages, eg. those that contain references to other objects, can cause horrendous problems with ownership and lifetime-management if passed by value. Example - a server socket object has a reference to a buffer-list object that contains an array of buffer-instances of varying size, (real example from IOCP server). Try passing that by value..
9) Many OS calls cannot handle anything but a pointer. You cannot PostMessage, (that's a Windows API, for all you happy-feet), even a 256-byte structure by value with one call, (you have just the 2 wParam,lParam integers). Calls that set up asychronous callbacks often allow 'context data' to be sent to the callback - almost always just one pointer. Any app that is going to use such OS functionality is almost forced to resort to pass by reference.
Jeremy Friesner's comment seems to be the best as this is a new area, although Martin James's points are also good. I know Microsoft is looking into message passing for their future kernels as we gain more cores.
There seems to be a framework that deals with message passing and it claims to have much better performance than current .Net producer/consumer generics. I'm not sure how it will compare to .Net's Dataflow in 4.5
https://github.com/odeheurles/Disruptor-net

Progress bar and multiple threads, decoupling GUI and logic - which design pattern would be the best?

I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.

Resources