Understanding sequential consistency

Understanding sequential consistency - multithreading

Assume there are 2 threads performing operations on a shared queue q. The lines of code for each thread are numbered, and initially the queue is empty.
Thread A:
A1) q.enq(x)
A2) q.deq()
Thread B:
B1) q.enq(y)
Assume that the order of execution is as follows:
A1) q.enq(x)
B1) q.enq(y)
A2) q.deq()
and as a result we get y (i.e. q.deq() returns y)
This execution is based on a well-known book and is said to be sequentially consistent. Notice that the method calls don’t overlap. How is that even possible? I believe that Thread A executed A1 without actually updating the queue until it proceeded to line A2 but that's just my guess. I'm even more confused if I look at this explanation from The Java Language Specification:
Sequential consistency is a very strong guarantee that is made about visibility and ordering in an execution of a program. Within a sequentially consistent execution, there is a total order over all individual actions (such as reads and writes) which is consistent with the order of the program, and each individual action is atomic and is immediately visible to every thread.
If that was the case, we would have dequeue x.
I'm sure I'm somehow wrong. Could somebody throw a light on this?

Note that the definition of sequential consistency says "consistent with program order", not "consistent with the order in which the program happens to be executed".
It goes on to say:
If a program has no data races, then all executions of the program will appear to be sequentially consistent.
(my emphasis of "appear").
Java's memory model does not enforce sequential consistency. As the JLS says:
If we were to use sequential consistency as our memory model, many of the compiler and processor optimizations that we have discussed would be illegal. For example, in the trace in Table 17.3, as soon as the write of 3 to p.x occurred, subsequent reads of that location would be required to see that value.
So Java's memory model doesn't actually support sequential consistency. Just the appearance of sequential consistency. And that only requires that there is some sequentially consistent order of actions that's consistent with program order.
Clearly there is some execution of threads A and B that could result in A2 returning y, specifically:
B1) q.enq(y)
A1) q.enq(x)
A2) q.deq()
So, even if the program happens to be executed in the order you specified, there is an order in which it could have been executed that is "consistent with program order" for which A2 returns y. Therefore, a program that returns y in that situation still gives the appearance of being sequentially consistent.
Note that this shouldn't be interpreted as saying that it would be illegal for A2 to return x, because there is a sequentially consistent sequence of operations that is consistent with program order that could give that result.
Note also that this appearance of sequential consistency only applies to correctly synchronized programs. If your program is not correctly synchronized (i.e. has data races) then all bets are off.

Related

What is thread synchronization and how does it differ form atomicity?

Atomicity can be achieved with machine level instructions such as compare and swap (CS).
It could also be achieved with the use of a mutex/lock for a large blocks of code with the OS providing help on it.
On the other hand we also have the concept of memory model. Some machines could have a relaxed model like Arm which could re-order load/stores on a single thread, and some have a more strict model like x86.
I want to confirm my understanding of the term synchronization. Is it pretty much the
promise of both atomicity and the memory model? i.e only using atomic ops on a thread doesn't necessary make it synchronized with other threads?

Something atomic is indivisible. Things that are synchronized are happening together in time.
Atomicity
I like to think of this like having a data structure representing a 2-dimensional point with x, y coordinates. For my purposes, in order for my data to be considered "valid" it must always be a point along the x = y line. x and y must always be the same.
Suppose that initially I have a point { x = 10, y = 10 } and I want to update my data structure so that it represents the point {x = 20, y = 20}. And suppose that the implementation of the update operation is basically these two separate steps:
x = 20
y = 20
If my implementation writes x and y separately like that, then some other thread could potentially observe my point data structure data after step 1 but before step 2. If it is allowed to read the value of the point after I change x but before I change y then that other observer might observe the value {x = 20, y = 10}.
In fact there are three values that could be observed
{x = 10, y = 10} (the original value) [VALID]
{x = 20, y = 10} (x is modified but y is not yet modified) [INVALID x != y]
{x = 20, y = 20} (both x and y are modified) [VALID]
I need a way of updating the two values together so that it is impossible for an outside observer observe {x = 20, y = 10}.
I don't really care when the other observer looks at the value of my point. It is fine it it observes { x = 10, y = 10 } and it is also fine if it observes { x = 20, y = 20 }. Both have the property of x == y, which makes them valid in my scenario.
Simplest atomic operation
The most simple atomic operation is a test and set of a single bit. This operation atomically reads a value of a bit and overwrites it with a 1, returning the state of the bit we overwrote. But we are offered the guarantee that if our operation has concluded then we have the value that we overwrote and any other observer will observe a 1. If many agents attempt this operation simultaneously, only one agent will return 0, and the others will all return 1. Even if it's two CPU's writing on the exact same clock tick, something in the electronics will guarantee that the operation is concluded logically atomically according to our rules.
That's it to logical atomicity. That's all atomic means. It means you have the capability of performing an uninterrupted update with valid data before and after the update and the data cannot be observed by another observer in any intermediate state it may take on during the update. It may be a single bit or it may be an entire database.
x86 Example
A good example of something that can be done on x86 atomically is the 32-bit interlocked increment.
Here a 32-bit (4-byte) value must be incremented by 1. This could potentially need to modify all 4 bytes for this to work correctly. If the value is to be modified from 0x000000FF to 0x00000100, it's important that the 0x00 becomes a 0x00 and the 0xFF becomes a 0x00 atomically. Otherwise I risk observing the value 0x00000000 (if the LSB is modified first) or 0x000001FF (if the MSB is modified first).
The hardware guarantees that we can test and modify 4 bytes at a time to achieve this. The CPU and memory provide a mechanism by which this operation can be performed even if there are other CPUs sharing the same memory. The CPU can assert a lock condition that prevents other CPUs from interfering with this interlocked operation.
Synchronization
Synchronization just talks about how things happen together in time. In the context you propose, it's about the order in which various sections of our program get executed and the order in which various components of our system change state. Without synchronization, we risk corruption (entering an invalid, semantically meaningless or incorrect state of execution of our program or its data)
Let's say we want to have an interlocked increment of a 64-bit number. Let's suppose that the hardware does not offer a way to atomically change 64-bits at a time. We will have to accomplish what we want with more complex data structure that means that even when just reading we can't simply read the most-significant 32 bits and the least-significant 32 bits of our 64-bit number separately. We'd risk observing one part of our 64-bit value changing separately from the other half. It means that we must adhere to some kind of protocol when reading (or writing) this 64-bit value.
To implement this, we need an atomic test and set bit operation and a clear bit operation. (FYI, technically, what we need are two operations commonly referred to as P and V in computer science, but let's keep it simple.) Before reading or writing our data, we perform an atomic test-and-set operation on a single (shared) bit (commonly referred to as a "lock"). If we read a zero, then we know we are the only one that saw a zero and everyone else must have seen a 1. If we see a 1, then we assume someone else is using our shared data, and therefore we have no choice but to just try again. So we loop and keep testing and setting the bit until we observe it as a 0. (This is called a spin lock, and is the best we can do without getting help from the operating system's scheduler.)
When we eventually see a 0, then we can safely read both 32-bit parts of our 64-bit value individually. Or, if we're writing, we can safely write both 32-bit parts of our 64-bit value individually. Once both halves have been read or written, we clear the bit back to 0, permitting access by someone else.
Any such combination of cleverness and use of atomic operations to avoid corruption in this manner constitutes synchronization because we are governing the order in which certain sections of our program can run. And we can achieve synchronization of any complexity and of any amount of data so long as we have access to some kind of atomic data.
Once we have created a program that uses a lock to share a data structure in a conflict-free way, we could also refer to that data structure as being logically atomic. C++ provides a std::atomic to achieve this, for example.
Remember that synchronization at this level (with a lock) is achieved by adhering to a protocol (protecting your data with a lock). Other forms of synchronization, such as what happens when two CPUs try to access the same memory on the same clock tick, are resolved in hardware by the CPUs and the motherboard, memory, controllers, etc. But fundamentally something similar is happening, just at the motherboard level.

Joining threads recursively in Prolog

I would like to create a variable number of threads in Prolog and make the main thread wait for all of them.
I have tried to make a join for each one of them in the predicate but it seems like they are waiting one for the other in a sequential order.
I have also tried storing the ids of the threads in a list and join each one after but it still isn't working.
In the code sample, I have also tried passing the S parameters in thread_join in the recursive call.
thr1(0):-!.
thr1(N):-
thread_create(someFunction(N),Id, []),
thread_join(Id, S),
N1 is N-1,
thr1(N1).
I expect the N predicates to overlap results when doing some print, but they are running in a sequential order.

Most likely the calls to your someFunction/1 predicate succeed faster than the time it takes to create the next thread, which is a relatively heavy process as SWI-Prolog threads are mapped to POSIX threads. Thus, to actually get overlapping results, the computation time of the thread goals must exceed thread creation time. For a toy example of accomplishing that, see:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/threads/sync

How does thread context-switching work with global variable?

I have been confused at this question:
I have C++ function:
void withdraw(int x) {
balance = balance - x;
}
balance is a global integer variable, which equals to 100 at the start.
We run the above function with two different thread: thread A and thread B. Thread A run withdraw(50) and thread B run withdraw(30).
Assuming we don't protect balance, what is the final result of balance after running those threads in following sequences?
A1->A2->A3->B1->B2->B3
B1->B2->B3->A1->A2->A3
A1->A2->B1->B2->B3->A3
B1->B2->A1->A2->A3->B3
Explanation:
A1 means OS execute the first line of function withdraw in thread A, A2 means OS execute the second line of function withdraw in thread A, B3 means OS execute the third line of function withdraw in thread B, and so on.
The sequence is how OS schedule thread A & B presumably.
My answer is
20
20
50 (Before context switch, OS saves balance. After context switch, OS restore balance to 50)
70 (Similar to above)
But my friend disagrees, he said that balance was a global variable. Thus it is not saved in stack, so it does not affected by context switching. He claimed that all 4 sequences result in 20.
So who is right? I can't find fault in his logic.
(We assume we have one processor that can only execute one thread at a time)

Consider this line:
balance = balance - x;
Thread A reads balance. It is 100. Now, thread A subtracts 50 and ... oops
Thread B reads balance. It is 100. Now, thread B subtracts 30 and updates the variable, which is now 70.
...thread A continues now updates the variable, which is now 50. You've just lost the work that Thread B.
Threads don't execute "lines of code" -- they execute machine instructions. It does not matter if a global variable is affected by context switching. What matters is when the variable is read, and when it is written, by each thread, because the value is "taken off the shelf" and modified, then "put back". Once the first thread has read the global variable and is working with the value "somewhere in space", the second thread must not read the global variable until the first thread has written the updated value.

Unless the threading standard you are using specifies, then there's no way to know. Most typical threading standards don't, so typically there's no way to know.
Your answer sounds like nonsense though. The OS has no idea what balance is nor any way to do anything to it around a context switch. Also, threads can run at the same time without context switches.
Your friend's answer also sounds like nonsense. How does he know that it won't be cached in a register by the compiler and thus some of the modifications will stomp on previous ones?
But the point is, both of you are just guessing about what might happen to happen. If you want to answer this usefully, you have to talk about what is guaranteed to happen.

Clearly homework, but saved by doing actual work before asking.
First, forget about context switching. Context switching is totally irrelevant to the problem. Assume that you have multiple processors, each executing one thread, and each progressing at an unknown speed, stopping and starting at unpredictable times. Even better, assume that this stopping and storing is controlled by an enemy, who will try to break your program.
And because context switching is irrelevant, the OS will not save or restore anything. It won't touch the variable balance. Only your two threads will.
Your friend is absolutely, totally wrong. It's quite the opposite. Since balance is a global variable, both threads can read and write it. But you don't only have the problem that they might read and write it in unknown order, as you examined, it is worse. They could access it at the same time, and if one thread modifies data while another reads it, you have a race condition and anything at all could happen. Not only could you get any result, your program could also crash.
If balance was a local variable saved on the stack, then both threads would have each its own variable, and nothing bad would happen.

Consider this line:
balance = balance - x;
Thread A reads balance. It is 100. Now, thread A subtracts 50 and ... oops
Thread B reads balance. It is 100. Now, thread B subtracts 30 and updates the variable, which is now 70.
...thread A continues and updates the variable, which is now 50. You've just completely lost the work of Thread B.
Threads don't execute "lines of code" -- they execute machine instructions. It does not matter if a global variable is affected by context switching. What matters is when the variable is read, and when it is written, by each thread, because the value is "taken off the shelf" and modified, then "put back". Once the first thread has read the global variable and is working with the value "somewhere in space", the second thread cannot read the global variable until the first thread has written the updated value.

Simple and short answer for c++: Unsynchronized access to a shared variable is undefined behavior, so anything can happen. The value can e.g. be 100,70,50,20,42 or -458995. The program could crash or not. And in theory its even allowed to order pizza.
The actual machine code that is executed is usually far away from what your program looks like and in the case of undefined behavior, you are no longer guaranteed, that the actual behavior has anything to do with the c++ code you have written.

If one thread writes to a location and another thread is reading, can the second thread see the new value then the old?

Start with x = 0. Note there are no memory barriers in any of the code below.
volatile int x = 0
Thread 1:
while (x == 0) {}
print "Saw non-zer0"
while (x != 0) {}
print "Saw zero again!"
Thread 2:
x = 1
Is it ever possible to see the second message, "Saw zero again!", on any (real) CPU? What about on x86_64?
Similarly, in this code:
volatile int x = 0.
Thread 1:
while (x == 0) {}
x = 2
Thread 2:
x = 1
Is the final value of x guaranteed to be 2, or could the CPU caches update main memory in some arbitrary order, so that although x = 1 gets into a CPU's cache where thread 1 can see it, then thread 1 gets moved to a different cpu where it writes x = 2 to that cpu's cache, and the x = 2 gets written back to main memory before x = 1.

Yes, it's entirely possible. The compiler could, for example, have just written x to memory but still have the value in a register. One while loop could check memory while the other checks the register.
It doesn't happen due to CPU caches because cache coherency hardware logic makes the caches invisible on all CPUs you are likely to actually use.
Theoretically, the write race you talk about could happen due to posted write buffering and read prefetching. Miraculous tricks were used to make this impossible on x86 CPUs to avoid breaking legacy code. But you shouldn't expect future processors to do this.

Leaving aside for a second tricks done by the compiler (even ones allowed by language standards), I believe you're asking how the micro-architecture could behave in such scenario. Keep in mind that the code would most likely expand into a busy wait loop of cmp [x] + jz or something similar, which hides a load inside it. This means that [x] is likely to live in the cache of the core running thread 1.
At some point, thread 2 would come and perform the store. If it resides on a different core, the line would first be invalidated completely from the first core. If these are 2 threads running on the same physical core - the store would immediately affect all chronologically younger loads.
Now, the most likely thing to happen on a modern out-of-order machine is that all the loads in the pipeline at this point would be different iterations of the same first loop (since any branch predictor facing so many repetitive "taken" resolution is likely to assume the branch will continue being taken, until proven wrong), so what would happen is that the first load to encounter the new value modified by the other thread will cause the matching branch to simply flush the entire pipe from all younger operations, without the 2nd loop ever having a chance to execute.
However, it's possible that for some reason you did get to the 2nd loop (let's say the predictor issue a not-taken prediction just at the right moment when the loop condition check saw the new value) - in this case, the question boils down to this scenario:
Time -->
----------------------------------------------------------------
thread 1
cmp [x],0 execute
je ... execute (not taken)
...
cmp [x],0 execute
jne ... execute (not taken)
Can_We_Get_Here:
...
thread2
store [x],1 execute
In other words, given that most modern CPUs may execute instructions out of order, can a younger load be evaluated before an older one to the same address, allowing the store (from another thread) to change the value so it may be observed inconsistently by the loads.
My guess is that the above timeline is quite possible given the nature of out-of-order execution engines today, as they simply arbitrate and perform whatever operation is ready. However, on most x86 implementations there are safeguards to protect against such a scenario, since the memory ordering rules strictly say -
8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
Such mechanisms may detect this scenario and flush the machine to prevent the stale/wrong values becoming visible. So The answer is - no, it should not be possible, unless of course the software or the compiler change the nature of the code to prevent the hardware from noticing the relation. Then again, memory ordering rules are sometimes flaky, and i'm not sure all x86 manufacturers adhere to the exact same wording, but this is a pretty fundamental example of consistency, so i'd be very surprised if one of them missed it.

The answer seems to be, "this is exactly the job of the CPU cache coherency." x86 processors implement the MESI protocol, which guarantee that the second thread can't see the new value then the old.

Reasoning about IORef operation reordering in concurrent programs

The docs say:
In a concurrent program, IORef operations may appear out-of-order to
another thread, depending on the memory model of the underlying
processor architecture...The implementation is required to ensure that
reordering of memory operations cannot cause type-correct code to go
wrong. In particular, when inspecting the value read from an IORef,
the memory writes that created that value must have occurred from the
point of view of the current thread.
Which I'm not even entirely sure how to parse. Edward Yang says
In other words, “We give no guarantees about reordering, except that
you will not have any type-safety violations.” ...
the last sentence remarks that an IORef is not allowed to point to
uninitialized memory
So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).
Things that seem clear to me from the documentation
within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered
What isn't clear to me
Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?
What's the straightforward mental model I should have? Is it something like:
within and from the point of view of an individual thread, the
ordering of IORef operations will appear sane and sequential; but the
compiler may actually reorder operations in such a way that break
certain assumptions in a concurrent system; however when a thread does
atomicModifyIORef, no threads will observe operations on that
IORef that appeared above the atomicModifyIORef to happen after,
and vice versa.
...? If not, what's the corrected version of the above?
If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.

I don't know Haskell concurrency, but I know something about memory models.
Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.
Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:
Thread 1: x=1
Thread 2: y=1
Thread 3: r1=x;
r2=y;
Thread 4: r4=y;
r3=x;
If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).
Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.
This is sufficiently implemented like so:
ordered load:
load x
load-load -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load
ordered store:
load-store
store-store -- ordered store must appear after all stores
-- preceding it in program order - serialize all stores
-- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
-- preceding them in program order
Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).
A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.
Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.
Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.
An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).
Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.
Thread 1: x=1;
if (z==0) compareAndSet(z, 0, y == 0? x: y);
Thread 2: y=2;
if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));
So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:
Thread 1: store x,1
load z
if ==0 then
load y
if == 0 then load x -- if loaded y is still 0, load x into tmp
else load y -- otherwise, load y into tmp
CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
-- the CAS may fail, but see explanation
Thread 2: store y,2
load x
if !=0 then
loop: load z -- into tmp
load y
if !=tmp then -- compare loaded y to tmp
CAS z, tmp, y -- attempt to CAS z: if it is still tmp, set to y
if ! then goto loop -- if CAS did not succeed, go to loop
If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.
If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.
However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.
If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.
On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.
The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)

within a thread an atomicModifyIORef "is never observed to take place
ahead of any earlier IORef operations, or after any later IORef
operations" i.e. we get a partial ordering of: stuff above the atomic
mod -> atomic mod -> stuff after. Although, the wording "is never
observed" here is suggestive of spooky behavior that I haven't
anticipated.
"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").
A readIORef x might be moved before writeIORef y, at least when there
are no data dependencies
True
Logically I don't see how something like readIORef x >>= writeIORef y
could be reordered
Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.
For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).
In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:
writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef
Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).
I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.
Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):
[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]
So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.

http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)
Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.
In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.
that is: the only semantic model you can have right now is "the way its implemented"
shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!
Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html .
Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string