Why we need SATB algorithm in concurrent GC tracing? - garbage-collection

Say in concurrent GC, at the beginning of tracing, object A is in
root set. And object A refers to B and C. B and C are on
heap. And during tracing, mutator changes A's reference to C. So
C is dead. And the SATB algorithm says C will be saved in remember
set by writer barrier. But if we did not save C in remember set,
there will be 2 cases could happen:
the mutator's changing to C happens before tracing thread
reaching to C. Then when tracing thread will not reach to C, C is
already dead. In this round of tracing, the live set includes only B.
the mutator's changing to C happens after tracing thread
reaching to C (but before next round of tracing). Then
tracing thread is able to reach C. In this round of tracing,
the live set includes B and C. C will be collected in next round
of tracing.
So in above 2 cases, we could always get the correct set B in first round of tracing or next round of tracing without SATB.
So my question is: in above case, SATB is not necessary at all to
keep concurrent GC correct. Why we bother to do save the snapshot
in writer barrier?

Your case is not really interesting for the discussion of concurrent collectors. C is dead, whether the collector notices its unreachability during the current or the next cycle makes little difference since a GC does not guarantee that memory gets freed immediately anyway.
The interesting cases are about live objects, since those are the ones don't want to be freed by accident. This can easily happen when a mutator takes an object (existing or newly created) that has not been marked yet and puts it in a field of a marked-reachable object. Since the holding object is already marked the collector it will not visit the other object and thus the write barrier will have to assist some way. SATB is one of those ways.

Related

(Conceptual) Which technique/pattern to use when a thread changes program parameters during execution? (multithreading)

I need to change a program with:
i/o thread A, that receives data from a stream, and
worker threads X, Y and Z, that do batch processing depending on global parameters acquired and set by A.
However, A cannot change these parameters while X, Y and Z are processing data, or the data would be corrupted. The changes should only be valid for the next batches.
A number of ideas has passed through my mind:
-1. When A receives changes, it has XYZ stop when they are done with latest batches. It then changes the global parameters and allows XYZ to resume.
-2. XYZ keep a local copy of the global parameters and update it every time they are done with a batch, thus taking in any changes. While A is changing the parameters, XYZ have to wait to access them. (I believe this technique is called mutex).
While thinking about it, it occurred to me that this must be a fairly common situation and that solutions for this must already exist. Is there a pattern that solves this (maybe 1 or 2 with an established name)? If not, what do you think of proposals 1 and 2? A mutex looks sufficient for 2, but how could 1 be implemented?
Thread A maintains a volatile/atomic pointer to an immutable parameter object.
X/Y/Z grab their own copy of this pointer just before processing a batch, and use the parameter object.
When A gets a parameter change, it creates an entirely new parameter object and updates the pointer to it.
If you're using a language with garbage collection, then that's all you have to worry about. If you're using C++, then you can use std::shared_ptr to manage the lifetime of the parameter objects.

Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues?

The standard CMS algorithm starts by making the application undergo a STW pause to calculate the GC-root-set. It then resumes mutator threads and both application and collector threads run concurrently until the marking is done. Any pointer store updated by a mutator-thread is protected by a write-barrier that will add that pointer reference to a write-queue.
When the marking phase is done we then proceed to the Remarking phase: it must then look into this write-queue and proceed to mark anything it finds there that was not already marked.
All of this makes sense. What I fail to understand is why would we need to:
Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?;
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
When reading one of the original papers on CMS A Generational Mostly-concurrent Garbage Collector one can see:
The original mostly-concurrent algorithm, proposed by
Boehm et al. [5], is a concurrent “tricolor” collector [9]. It
uses a write barrier to cause updates of fields of heap objects
to shade the containing object gray. Its main innovation is
that it trades off complete concurrency for better throughput, by allowing root locations (globals, stacks, registers),
which are usually updated more frequently than heap locations, to be written without using a barrier to maintain
the tricolor invariant.
it makes it look like this is just a trade-off emanating from a conscious decision to not involve what's happening on the stack in the write-barriers?
Thanks
Have this remarking phase recalculate the GC-root-set from scratch (including all thread stacks) -- does not doing this result in an incorrect algorithm, in the sense of it marking actually live and reachable objects as garbage to be reclaimed?
No, tricolor marking marks live objects (objects unmarked by then "grey" set is exhausted are unreachable). Remark add rediscovered root objects to "grey" set together with all reference caught by write-barrier, so more objects could be marked as live.
In summary, after CMS remark all live objects are marked, though some dead objects could be marked too.
Have this remarking phase be another STW event (maybe this is because of having to analyse all the thread-stacks?)
Yes, remark is STW pause in CMS algorithm in HotSpot JVM (you can read more about CMS phases here).
And answering question from title
Why does Concurrent-Mark-Sweep (CMS) remark phase need to re-examine the thread-stacks instead of just looking at the mutator's write-queues?
CMS does not use "mutator's write-queues", it does utilize card marking write barrier (shared with young generation copy collector).
Generally all algorithms using write barriers need STW pause to avoid "turtle and arrow" paradox.
CMS starts initial tri-color marking. Then it completed "some" live objects are marked, but due to concurrent modifications marking could miss certain objects. Though write-barrier captures all mutations, thus "pre clean" add all mutated references to "gray" set and resume marking reaching missed objects. Though for this process to converge, final remark with mutator stopped is required.

locking between 2 user threads

I have a main thread that creates/destroys objects. Let's name the object 'f'.
Now, every time this object is created it is added to the tailqueue of another object - say 'mi' . conversely when this object is deleted.
Now, there is another thread that runs every second, that tries to gather say statistics for this object 'f'. So it basically walks through all the max possible instance of 'mi' (say 2048)and then for each such 'mi', it gathers all the 'f' objects attached to it, sends a cmd down to the lower layer which emits some values corresponding to these objects. Now it must update the corresponding 'f' objects with these values.
Now the concern is what IF one of these 'f' objects gets deleted by the main thread while this walk is happening every 1s ?
Intuitively one would think of having a lock at the 'mi' level that is acquired before beginning the walk and released post the walk /update of all the 'f' objects belonging to a particular instance of 'mi', correct?
But the only hitch with this is that there could be 10,000's and even millions of 'f' objects tied to this instance of 'mi'.
The other requirement being that the main thread performance of creating/destroying these 'f' objects should be high i.e at the rate of atleast 10000 objects per second....
So given that, i'm not sure if it's feasible to have this per 'mi' object lock? Or am i overestimating the side effects of lock contention?
Any other ideas ?
Now the concern is what IF one of these 'f' objects gets deleted by
the main thread while this walk is happening every 1s ?
If an f object gets deleted while the other thread is trying to use it, undefined behavior will be invoked and you will probably end up spending some hours debugging your program to try to figure out why it is occasionally crashing. :) The trick is to make sure that you never delete any f while the other thread might be using it -- typically that would mean that your main thread needs to lock the mi's mutex before removing the f from its queue -- once the f is no longer in the queue, you can release the mutex before deleting the f if you want to, since at that point the other thread will not be able to access the f anyway.
i'm not sure if it's feasible to have this per 'mi' object lock?
It's feasible, as long as you don't mind your main thread occasionally getting held off (i.e. blocked waiting in a mutex::lock() method-call) until your other thread finishes iterating through the mi's queue and releases the mutex. Whether that holdoff-time is acceptable or not will depend on what the latency requirements of your main thread are (e.g. if it's generating a report, then being blocked for some number of milliseconds is no problem; OTOH if it is operating the control surfaces on a rocket in flight, being blocked for any length of time is unacceptable)
Any other ideas ?
My first idea is to get rid of the second thread entirely -- just have main thread call the statistics-collection function directly once per second, instead. Then you don't have to worry about mutexes or mutex-contention at all. This does mean that your main thread won't be able to perform its primary function during the time it is running the statistics-collection function, but at least now its "down time" is predictable rather than being a random function of which mi objects the two threads happen to try to lock/access at any given instant.
If that's no good (i.e. you can't tolerate any significant hold-off time whatsoever), another approach would be to use a message-passing paradigm rather than a shared-data paradigm. That is, instead of allowing both threads direct access the same set of mi's, use a message-queue of some sort so that the main thread can take a mi out of service and send it over to the second thread for statistics-gathering purposes. The second thread would then scan/update it as usual, and when it's done, pass it back (via a second message-queue) to the primary thread, which would put it back into service. You could periodically do this with various mi's to keep statistics updated on each of them without every requiring shared access to any of them. (This only works if your main thread can afford to go without access to certain mi's for short periods, though)

Is there some algorithm for R/W lock graphs?

Suppose we have resources A,B,C and their dependencies not cyclic:
B->A
C->A
Means B strongly depends on A and C strongly depends on A. For example: B,C is precomputed resources from A. So if A updates, B,C should be updated too. But if B updated - nothing changes except B.
And for the problem: Considering the fact that each node of graph can be accessed for Read or Write or Read/Upgrade to Write in multi-threaded manner, how one supposed to manage locks in such graph? Is there generalization of this problem?
Update
Sorry for not clear question. Here is also one very important thing:
If for example A changes and will force B,C to be updated it means that the moment B and their dependencies updates - it will free write lock.
Your question is a blend of transaction - locking - concurrency - conflict resolution. Therefore models used in relational databases might serve your purpose.
There are many methods defined for concurrency control.
In your case some might apply depending of how optimistic or pessimistic your algorithm needs to be, how many reads or writes, and what is the amount of data per-transaction.
I can think of the two methods that can help in your case:
1. Strict Two-Phase Locking (SSPL or S2PL)
A transaction begins, A, B, C locks are being obtained and are kept until the end of the transaction. Because multiple locks are kept until the end of the transaction, while acquiring the locks a deadlock condition might be encountered. Locks can change during the transaction time.
This approach is serializable, meaning that all events come in order and no other party can make any changes while the transaction holds.
This approach is pessimistic and locks might hold for a good amount of time, thus resources and time will be spent.
2. Multiversion
Instead of placing locks on A, B, C, maintain version numbers and create a snapshot of each. All changes will be done to snapshots. At the end, all snapshots will replace the previous versions. If any version of A, B and C has changed then an error condition occurs and changes are discarded.
This approach does not place read or write locks meaning that will be fast. But in case of conflicts, if any version has changed in the interim, then data will be discarded.
This is optimistic but might spend much more resources in favor of speed.
Transaction log
In database systems there is also the concept of "transaction log". This means that any transaction being it completed or pending will be present in the "transaction log". So every operation done in any of the above methods is first done to the transaction log. Operations from the log will be materialized at the right moment in the main store. In case of failures the log is analyzed, completed transactions are materialized to the main store and the pending ones are just discarded.
This is used also in "log shipping" in order to ship the log to other servers for the purpose of replication.
Known Implementations
There are multiple in-memory databases that might prevent some hassle with implementing your own solution.
H2 provides also serializable isolation level that can match your use case.
go-memdb provides multiversion concurrency. This one uses an immutable radix tree algorithm, therefore you can look also into this one for details if you are searching to build your own solution.
Many more are defined here.
I am not aware of a specific pattern here; so my solution would go like this:
First of all, I would reverse the edges in your graph. You don't care that A is a dependency for B; meaning: the other direction is telling you what is required to lock on:
A->B
A->C
Because now you can say: if I want to do X on A, I need the X lock on A, and any object depending on A.
And now you can go; inspect A, and the objects depending on A; ... and so forth to determine the set of objects you need an X lock on.
Regarding your comment: Because X in this case is either Read or UpgradedWrite, and if A need Write it doesn't clearly mean that B needs it to. ... for me that translates to: the whole "graph idea" doesn't help. You see, such a graph is only useful to express such direct relations, such as "if a then b". If there is an edge between A and B, then that means that you would want to treat them "the same way". If you are now saying that your objects might or might not need to be both write locked - what would be the point of this graph then? Because then you end up with a lot of actually independent objects, and sometimes a write to A needs a write lock something else; and sometimes not.

Using threadsafe initialization in a JRuby gem

Wanting to be sure we're using the correct synchronization (and no more than necessary) when writing threadsafe code in JRuby; specifically, in a Puma instantiated Rails app.
UPDATE: Extensively re-edited this question, to be very clear and use latest code we are implementing. This code uses the atomic gem written by #headius (Charles Nutter) for JRuby, but not sure it is totally necessary, or in which ways it's necessary, for what we're trying to do here.
Here's what we've got, is this overkill (meaning, are we over/uber-engineering this), or perhaps incorrect?
ourgem.rb:
require 'atomic' # gem from #headius
SUPPORTED_SERVICES = %w(serviceABC anotherSvc andSoOnSvc).freeze
module Foo
def self.included(cls)
cls.extend(ClassMethods)
cls.send :__setup
end
module ClassMethods
def get(service_name, method_name, *args)
__cached_client(service_name).send(method_name.to_sym, *args)
# we also capture exceptions here, but leaving those out for brevity
end
private
def __client(service_name)
# obtain and return a client handle for the given service_name
# we definitely want to cache the value returned from this method
# **AND**
# it is a requirement that this method ONLY be called *once PER service_name*.
end
def __cached_client(service_name)
##_clients.value[service_name]
end
def __setup
##_clients = Atomic.new({})
##_clients.update do |current_service|
SUPPORTED_SERVICES.inject(Atomic.new({}).value) do |memo, service_name|
if current_services[service_name]
current_services[service_name]
else
memo.merge({service_name => __client(service_name)})
end
end
end
end
end
end
client.rb:
require 'ourgem'
class GetStuffFromServiceABC
include Foo
def self.get_some_stuff
result = get('serviceABC', 'method_bar', 'arg1', 'arg2', 'arg3')
puts result
end
end
Summary of the above: we have ##_clients (a mutable class variable holding a Hash of clients) which we only want to populate ONCE for all available services, which are keyed on service_name.
Since the hash is in a class variable (and hence threadsafe?), are we guaranteed that the call to __client will not get run more than once per service name (even if Puma is instantiating multiple threads with this class to service all the requests from different users)? If the class variable is threadsafe (in that way), then perhaps the Atomic.new({}) is unnecessary?
Also, should we be using an Atomic.new(ThreadSafe::Hash) instead? Or again, is that not necessary?
If not (meaning: you think we do need the Atomic.news at least, and perhaps also the ThreadSafe::Hash), then why couldn't a second (or third, etc.) thread interrupt between the Atomic.new(nil) and the ##_clients.update do ... meaning the Atomic.news from EACH thread will EACH create two (separate) objects?
Thanks for any thread-safety advice, we don't see any questions on SO that directly address this issue.
Just a friendly piece of advice, before I attempt to tackle the issues you raise here:
This question, and the accompanying code, strongly suggests that you don't (yet) have a solid grasp of the issues involved in writing multi-threaded code. I encourage you to think twice before deciding to write a multi-threaded app for production use. Why do you actually want to use Puma? Is it for performance? Will your app handle many long-running, I/O-bound requests (like uploading/downloading large files) at the same time? Or (like many apps) will it primarily handle short, CPU-bound requests?
If the answer is "short/CPU-bound", then you have little to gain from using Puma. Multiple single-threaded server processes would be better. Memory consumption will be higher, but you will keep your sanity. Writing correct multi-threaded code is devilishly hard, and even experts make mistakes. If your business success, job security, etc. depends on that multi-threaded code working and working right, you are going to cause yourself a lot of unnecessary pain and mental anguish.
That aside, let me try to unravel some of the issues raised in your question. There is so much to say that it's hard to know where to start. You may want to pour yourself a cold or hot beverage of your choice before sitting down to read this treatise:
When you talk about writing "thread-safe" code, you need to be clear about what you mean. In most cases, "thread-safe" code means code which doesn't concurrently modify mutable data in a way which could cause data corruption. (What a mouthful!) That could mean that the code doesn't allow concurrent modification of mutable data at all (using locks), or that it does allow concurrent modification, but makes sure that it doesn't corrupt data (probably using atomic operations and a touch of black magic).
Note that when your threads are only reading data, not modifying it, or when working with shared stateless objects, there is no question of "thread safety".
Another definition of "thread-safe", which probably applies better to your situation, has to do with operations which affect the outside world (basically I/O). You may want some operations to only happen once, or to happen in a specific order. If the code which performs those operations runs on multiple threads, they could happen more times than desired, or in a different order than desired, unless you do something to prevent that.
It appears that your __setup method is only called when ourgem.rb is first loaded. As far as I know, even if multiple threads require the same file at the same time, MRI will only ever let a single thread load the file. I don't know whether JRuby is the same. But in any case, if your source files are being loaded more than once, that is symptomatic of a deeper problem. They should only be loaded once, on a single thread. If your app handles requests on multiple threads, those threads should be started up after the application has loaded, not before. This is the only sane way to do things.
Assuming that everything is sane, ourgem.rb will be loaded using a single thread. That means __setup will only ever be called by a single thread. In that case, there is no question of thread safety at all to worry about (as far as initialization of your "client cache" goes).
Even if __setup was to be called concurrently by multiple threads, your atomic code won't do what you think it does. First of all, you use Atomic.new({}).value. This wraps a Hash in an atomic reference, then unwraps it so you just get back the Hash. It's a no-op. You could just write {} instead.
Second, your Atomic#update call will not prevent the initialization code from running more than once. To understand this, you need to know what Atomic actually does.
Let me pull out the old, tired "increment a shared counter" example. Imagine the following code is running on 2 threads:
i += 1
We all know what can go wrong here. You may end up with the following sequence of events:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A writes its incremented value back to i.
Thread B writes its incremented value back to i.
So we lose an update, right? But what if we store the counter value in an atomic reference, and use Atomic#update? Then it would be like this:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A tries to write its incremented value back to i, and succeeds.
Thread B tries to write its incremented value back to i, and fails, because the value has already changed.
Thread B reads i again and increments it.
Thread B tries to write its incremented value back to i again, and succeeds this time.
Do you get the idea? Atomic never stops 2 threads from running the same code at the same time. What it does do, is force some threads to retry the #update block when necessary, to avoid lost updates.
If your goal is to ensure that your initialization code will only ever run once, using Atomic is a very inappropriate choice. If anything, it could make it run more times, rather than less (due to retries).
So, that is that. But if you're still with me here, I am actually more concerned about whether your "client" objects are themselves thread-safe. Do they have any mutable state? Since you are caching them, it seems that initializing them must be slow. Be that as it may, if you use locks to make them thread-safe, you may not be gaining anything from caching and sharing them between threads. Your "multi-threaded" server may be reduced to what is effectively an unnecessarily complicated, single-threaded server.
If the client objects have no mutable state, good for you. You can be "free and easy" and share them between threads with no problems. If they do have mutable state, but initializing them is slow, then I would recommend caching one object per thread, so they are never shared. Thread[] is your friend there.

Resources