Haskell evaluation synchronisation between threads - haskell

I'm trying to understand how GHC Haskell synchronises the computation of "basic" values (i.e. not IORef, TVar, etc.) between threads. I have searched for information about this but haven't found anything clear.
Take the following example program:
import Control.Concurrent
expensiveFunction x = sum [1..x] -- Just an example
val = expensiveFunction 12345
thread1 = print val
thread2 = print val
main = do
forkOS thread1
forkOS thread2
I understand that the value val will initially be represented by an unevaluated closure. In order to print val, the program must first evaluate it. Once a toplevel binding has been evaluated it should not need to be evaluated again.
Is the representation for "val" even shared by separate threads?
If for some reason thread1 completes evaluation first, can it convey the final computed value to thread2 by swapping out the pointer? How would that be synchronised?
If thread1 is busy evaluating when thread2 wants the value, does thread2 wait for it to finish or do they both race to evaluate it first?

In GHC-compiled programs, values go through three(-ish) phases of evaluation:
Thunk. This is where they start.
Black hole. When forced, a thunk is converted to a black hole and computation begins. Other threads that request the value of a black hole will instead add themselves to a notification list for when the black hole is updated. (Also, if the thunk itself tries to access the black hole, it will short-circuit to an exception instead of waiting forever.)
Evaluated. When the computation finishes, its last task is to update the black hole to a plain value (well, WHNF value, anyway).
The pointer that is getting updated during these phase transitions is shared with other threads and not protected from race conditions. This means that, very rarely, it is possible for two (or more) threads to both see a pointer in phase 1 and for both to execute the 1 -> 2 transition; in that case, both will evaluate the thunk, and the transition 2 -> 3 will also happen twice. Notably, though, the 1 -> 2 transition is typically much faster than the computation it is replacing (essentially just a memory access or two), in part exactly so that the race is difficult to trigger.
Because the language is pure, the racing threads will come to the same answer. So there is no semantic difficulty here. But in some rare cases, a little bit of work may be duplicated. It is very, very rare that the overhead of a lock on every 1 -> 2 transition would be better than this slight duplication. (If you find it is in your case, consider manually protecting the evaluation of whichever expensive thing is being shared!)
Corollary: great care must be taken with the unsafe IO a -> a family of functions; some guarantee synchronization of the evaluation of the resulting a and some don't. If your IO a action is not as pure as you promised it is, and a race causes it to be executed twice, all manner of strange heisenbugs can occur.

Related

Weak Reference Finalizer Guaranteed to Run

In The cost of weak pointers and finalizers in GHC, Edward Yang writes (emphasis added):
A weak pointer can also optionally be associated with a finalizer, which is run when the object is garbage collected. Haskell finalizers are not guaranteed to run.
I cannot find any documentation that corroborates this claim. The docs in System.Mem.Weak are not explicit about this. What I need to know is, given some primitive that has identity (MutVar#, MutableArray#, Array#, etc.), if I attach a finalizer to it, will it reliably be called when the value gets GCed?
The reason is that I'm considering doing something like this:
data OffHeapTree = OffHeapTree
{ ref :: IORef ()
, nodeCount :: Int
, nodeArray :: Ptr Node
}
data Node = Node
{ childrenArray :: Ptr Node
, childrenCount :: Int
, value :: Int
}
I want to make sure that I free the array (and everything the array points to) when an OffHeapTree goes out of scope. Otherwise, it would leak memory. So, can this be reliably accomplished with mkWeakIORef or not?
"Haskell finalizers are not guaranteed to run" means that GC may not be performed (e.g. on program exit). But if GC is performed, then finalizers are executed.
Edit: For future readers: the statement above is not exactly correct. RTS spawns a separate thread to execute finalizers after GC. So the program may exit after GC is performed, but finalizers are not yet executed, see this comment.
That is true in theory anyway. In practice finalizer may not be executed, e.g. when RTS tries to execute a number of finalizers in a row, and one of then throws an exception. So I'd not use finalizers unless it is unavoidable.

Reasoning about IORef operation reordering in concurrent programs

The docs say:
In a concurrent program, IORef operations may appear out-of-order to
another thread, depending on the memory model of the underlying
processor architecture...The implementation is required to ensure that
reordering of memory operations cannot cause type-correct code to go
wrong. In particular, when inspecting the value read from an IORef,
the memory writes that created that value must have occurred from the
point of view of the current thread.
Which I'm not even entirely sure how to parse. Edward Yang says
In other words, “We give no guarantees about reordering, except that
you will not have any type-safety violations.” ...
the last sentence remarks that an IORef is not allowed to point to
uninitialized memory
So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).
Things that seem clear to me from the documentation
within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered
What isn't clear to me
Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?
What's the straightforward mental model I should have? Is it something like:
within and from the point of view of an individual thread, the
ordering of IORef operations will appear sane and sequential; but the
compiler may actually reorder operations in such a way that break
certain assumptions in a concurrent system; however when a thread does
atomicModifyIORef, no threads will observe operations on that
IORef that appeared above the atomicModifyIORef to happen after,
and vice versa.
...? If not, what's the corrected version of the above?
If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.
I don't know Haskell concurrency, but I know something about memory models.
Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.
Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:
Thread 1: x=1
Thread 2: y=1
Thread 3: r1=x;
r2=y;
Thread 4: r4=y;
r3=x;
If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).
Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.
This is sufficiently implemented like so:
ordered load:
load x
load-load -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load
ordered store:
load-store
store-store -- ordered store must appear after all stores
-- preceding it in program order - serialize all stores
-- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
-- preceding them in program order
Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).
A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.
Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.
Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.
An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).
Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.
Thread 1: x=1;
if (z==0) compareAndSet(z, 0, y == 0? x: y);
Thread 2: y=2;
if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));
So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:
Thread 1: store x,1
load z
if ==0 then
load y
if == 0 then load x -- if loaded y is still 0, load x into tmp
else load y -- otherwise, load y into tmp
CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
-- the CAS may fail, but see explanation
Thread 2: store y,2
load x
if !=0 then
loop: load z -- into tmp
load y
if !=tmp then -- compare loaded y to tmp
CAS z, tmp, y -- attempt to CAS z: if it is still tmp, set to y
if ! then goto loop -- if CAS did not succeed, go to loop
If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.
If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.
However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.
If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.
On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.
The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)
within a thread an atomicModifyIORef "is never observed to take place
ahead of any earlier IORef operations, or after any later IORef
operations" i.e. we get a partial ordering of: stuff above the atomic
mod -> atomic mod -> stuff after. Although, the wording "is never
observed" here is suggestive of spooky behavior that I haven't
anticipated.
"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").
A readIORef x might be moved before writeIORef y, at least when there
are no data dependencies
True
Logically I don't see how something like readIORef x >>= writeIORef y
could be reordered
Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.
For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).
In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:
writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef
Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).
I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.
Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):
[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]
So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.
http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)
Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.
In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.
that is: the only semantic model you can have right now is "the way its implemented"
shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!
Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html .
Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

What are the C++11 memory ordering guarantees in this corner case?

I'm writing some lock-free code, and I came up with an interesting pattern, but I'm not sure if it will behave as expected under relaxed memory ordering.
The simplest way to explain it is using an example:
std::atomic<int> a, b, c;
auto a_local = a.load(std::memory_order_relaxed);
auto b_local = b.load(std::memory_order_relaxed);
if (a_local < b_local) {
auto c_local = c.fetch_add(1, std::memory_order_relaxed);
}
Note that all operations use std::memory_order_relaxed.
Obviously, on the thread that this is executed on, the loads for a and b must be done before the if condition is evaluated.
Similarly, the read-modify-write (RMW) operation on c must be done after the condition is evaluated (because it's conditional on that... condition).
What I want to know is, does this code guarantee that the value of c_local is at least as up-to-date as the values of a_local and b_local? If so, how is this possible given the relaxed memory ordering? Is the control dependency together with the RWM operation acting as some sort of acquire fence? (Note that there's not even a corresponding release anywhere.)
If the above holds true, I believe this example should also work (assuming no overflow) -- am I right?
std::atomic<int> a(0), b(0);
// Thread 1
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
if (a_local >= 0) { // Always true at runtime
b.fetch_add(1, std::memory_order_relaxed);
}
}
// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
if (b_local < 777) {
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local); // Is this guaranteed?
}
On thread 1, there is a control dependency which I suspect guarantees that a is always incremented before b is incremented (but they each keep being incremented neck-and-neck). On thread 2, there is another control dependency which I suspect guarantees that b is loaded into b_local before a is incremented. I also think that the value returned from fetch_add will be at least as recent as any observed value in b_local, and the assert should therefore hold. But I'm not sure, since this departs significantly from the usual memory-ordering examples, and my understanding of the C++11 memory model is not perfect (I have trouble reasoning about these memory ordering effects with any degree of certainty). Any insights would be appreciated!
Update: As bames53 has helpfully pointed out in the comments, given a sufficiently smart compiler, it's possible that an if could be optimised out entirely under the right circumstances, in which case the relaxed loads could be reordered to occur after the RMW, causing their values to be more up-to-date than the fetch_add return value (the assert could fire in my second example). However, what if instead of an if, an atomic_signal_fence (not atomic_thread_fence) is inserted? That certainly can't be ignored by the compiler no matter what optimizations are done, but does it ensure that the code behaves as expected? Is the CPU allowed to do any re-ordering in such a case?
The second example then becomes:
std::atomic<int> a(0), b(0);
// Thread 1
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
b.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local); // Is this guaranteed?
Another update: After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard. So, can anyone come up with a counter-example of a theoretical system that complies with the standard and also fires the assert?
Signal fences don't provide the necessary guarantees (well, not unless 'thread 2' is a signal hander that actually runs on 'thread 1').
To guarantee correct behavior we need synchronization between threads, and the fence that does that is std::atomic_thread_fence.
Let's label the statements so we can diagram various executions (with thread fences replacing signal fences, as required):
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed); // A
std::atomic_thread_fence(std::memory_order_acq_rel); // B
b.fetch_add(1, std::memory_order_relaxed); // C
}
auto b_local = b.load(std::memory_order_relaxed); // X
std::atomic_thread_fence(std::memory_order_acq_rel); // Y
auto a_local = a.fetch_add(1, std::memory_order_relaxed); // Z
So first let's assume that X loads a value written by C. The following paragraph specifies that in that case the fences synchronize and a happens-before relationship is established.
29.8/2:
A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
And here's a possible execution order where the arrows are happens-before relations.
Thread 1: A₁ → B₁ → C₁ → A₂ → B₂ → C₂ → ...
↘
Thread 2: X → Y → Z
If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. — [C++11 1.10/18]
So the load at Z must take its value from A₁ or from a subsequent modification. Therefore the assert holds because the value written at A₁ and at all later modifications is greater than or equal to the value written at C₁ (and read by X).
Now let's look at the case where the fences do not synchronize. This happens when the load of b does not load a value written by thread 1, but instead reads the value that b is initialized with. There's still synchronization where the threads starts though:
30.3.1.2/5
Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.
This is specifying the behavior of std::thread's constructor. So (assuming the thread creation is correctly sequenced after the initialization of a) the value read by Z must take its value from the initialization of a or from one of the subsequent modifications on thread 1, which means that the assertions still holds.
This example gets at a variation of reads-from-thin-air like behavior. The relevant discussion in the spec is in section 29.3p9-11. Since the current version of the C11 standard doesn't guarantee dependences be respected, the memory model should allow the assertion to be fired. The most likely situation is that the compiler optimizes away the check that a_local>=0. But even if you replace that check with a signal fence, CPUs would be free to reorder those instructions.
You can test such code examples under the C/C++11 memory models using the open source CDSChecker tool.
The interesting issue with your example is that for an execution to violate the assertion, there has to be a cycle of dependences. More concretely:
The b.fetch_add in thread one depends on the a.fetch_add in the same loop iteration due to the if condition. The a.fetch_add in thread 2 depends on b.load. For an assertion violation, we have to have T2's b.load read from a b.fetch_add in a later loop iteration than T2's a.fetch_add. Now consider the b.fetch_add the b.load reads from and call it # for future reference. We know that b.load depends on # as it takes it value from #.
We know that # must depend on T2's a.fetch_add as T2's a.fetch_add atomic reads and updates a prior a.fetch_add from T1 in the same loop iteration as #. So we know that # depends on the a.fetch_add in thread 2. That gives us a cycle in dependences and is plain weird, but allowed by the C/C++ memory model. The most likely way of actually producing that cycle is (1) compiler figures out that a.local is always greater than 0, breaking the dependence. It can then do loop unrolling and reorder T1's fetch_add however it wants.
After reading all the responses so far and combing through the
standard myself, I don't think it can be shown that the code is
correct using only the standard.
And unless you admit that non atomic operations are magically safer and more ordered then relaxed atomic operations (which is silly) and that there is one semantic of C++ without atomics (and try_lock and shared_ptr::count) and another semantic for those features that don't execute sequentially, you also have to admit that no program at all can be proven correct, as the non atomic operations don't have an "ordering" and they are needed to construct and destroy variables.
Or, you stop taking the standard text as the only word on the language and use some common sense, which is always recommended.

Why is threading dangerous?

I've always been told to puts locks around variables that multiple threads will access, I've always assumed that this was because you want to make sure that the value you are working with doesn't change before you write it back
i.e.
mutex.lock()
int a = sharedVar
a = someComplexOperation(a)
sharedVar = a
mutex.unlock()
And that makes sense that you would lock that. But in other cases I don't understand why I can't get away with not using Mutexes.
Thread A:
sharedVar = someFunction()
Thread B:
localVar = sharedVar
What could possibly go wrong in this instance? Especially if I don't care that Thread B reads any particular value that Thread A assigns.
It depends a lot on the type of sharedVar, the language you're using, any framework, and the platform. In many cases, it's possible that assigning a single value to sharedVar may take more than one instruction, in which case you may read a "half-set" copy of the value.
Even when that's not the case, and the assignment is atomic, you may not see the latest value without a memory barrier in place.
MSDN Magazine has a good explanation of different problems you may encounter in multithreaded code:
Forgotten Synchronization
Incorrect Granularity
Read and Write Tearing
Lock-Free Reordering
Lock Convoys
Two-Step Dance
Priority Inversion
The code in your question is particularly vulnerable to Read/Write Tearing. But your code, having neither locks nor memory barriers, is also subject to Lock-Free Reordering (which may include speculative writes in which thread B reads a value that thread A never stored) in which side-effects become visible to a second thread in a different order from how they appeared in your source code.
It goes on to describe some known design patterns which avoid these problems:
Immutability
Purity
Isolation
The article is available here
The main problem is that the assignment operator (operator= in C++) is not always guaranteed to be atomic (not even for primitive, built in types). In plain English, that means that assignment can take more than a single clock cycle to complete. If, in the middle of that, the thread gets interrupted, then the current value of the variable might be corrupted.
Let me build off of your example:
Lets say sharedVar is some object with operator= defined as this:
object& operator=(const object& other) {
ready = false;
doStuff(other);
if (other.value == true) {
value = true;
doOtherStuff();
} else {
value = false;
}
ready = true;
return *this;
}
If thread A from your example is interrupted in the middle of this function, ready will still be false when thread B starts to run. This could mean that the object is only partially copied over, or is in some intermediate, invalid state when thread B attempts to copy it into a local variable.
For a particularly nasty example of this, think of a data structure with a removed node being deleted, then interrupted before it could be set to NULL.
(For some more information regarding structures that don't need a lock (aka, are atomic), here is another question that talks a bit more about that.)
This could go wrong, because threads can be suspended and resumed by the thread scheduler, so you can't be sure about the order these instructions are executed. It might just as well be in this order:
Thread B:
localVar = sharedVar
Thread A:
sharedVar = someFunction()
In which case localvar will be null or 0 (or some completeley unexpected value in an unsafe language), probably not what you intended.
Mutexes actually won't fix this particular issue by the way. The example you supply does not lend itself well for parallelization.

Is the random number generator in Haskell thread-safe?

Is the same "global random number generator" shared across all threads, or does each thread get its own?
If one is shared, how can I ensure thread-safety? The approach using getStdGen and setStdGen described in the "Monads" chapter of Real World Haskell doesn't look safe.
If each thread has an independent generator, will the generators for two threads started in rapid succession have different seeds? (They won't, for example, if the seed is a time in seconds, but milliseconds might be OK. I don't see how to get a time with millisecond resolution from Data.Time.)
There is a function named newStdGen, which gives one a new std. gen every time it's called. Its implementation uses atomicModifyIORef and thus is thread-safe.
newStdGen is better than get/setStdGen not only in terms of thread-safety, but it also guards you from potential single-threaded bugs like this: let rnd = (fst . randomR (1,5)) <$> getStdGen in (==) <$> rnd <*> rnd.
In addition, if you think about the semantics of newStdGen vs getStdGen/setStdGen, the first ones can be very simple: you just get a new std. gen in a random state, chosen non-deterministically. On the other hand, with the get/set pair you can't abstract away the global program state, which is bad for multiple reasons.
I would suggest you to use getStdGen only once (in the main thread) and then use the split function to generate new generators. I would do it like this:
Make an MVar that contains the generator. Whenever a thread needs a new generator, it takes the current value out of the MVar, calls split and puts the new generator back. Due to the functionality of an MVar, this should be threadsafe.
By itself, getStdGen and setStdGen are not thread safe in a certain sense. Suppose the two threads both perform this action:
do ...
g <- getStdGen
(v, g') <- someRandOperation g
setStdGen g'
It is possible for the threads to both run the g <- getStdGen line before the other thread reaches setStdGen, therefore they both could get the exact same generator. (Am I wrong?)
If they both grab the same version of the generator, and use it in the same function, they will get the same "random" result. So you do need to be a little more careful when dealing with random number generation and multithreading. There are many solutions; one that comes to mind is to have a single dedicated random number generator thread that produces a stream of random numbers which other threads could consume in a thread-safe way. Putting the generator in an MVar, as FUZxxl suggests, is probably the simplest and most straightforward solution.
Of course I would encourage you to inspect your code and make sure it is necessary to generate random numbers in more than one thread.
You can use split as in FUZxxl's answer. However, instead of using an MVar, whenever you call forkIO, just have your IO action for the forked thread close over one of the resulting generators, and leave the other one with the original thread. This way each thread has its own generator.
As Dan Burton said, do inspect your code and see if you really need RNG in multiple threads.

Resources