PlusCal: Why does fair algorithm still stutter? - tla+

I've used PlusCal to model a trivial state machine that accepts strings matching the regular expression (X*)(Y).
(*--fair algorithm stateMachine
variables
state = "start";
begin
Loop:
while state /= "end" do
either
\* we got an X, keep going
state := "reading"
or
\* we got a Y, terminate
state := "end"
end either;
end while;
end algorithm;*)
Even though I've marked the algorithm fair, the following temporal condition fails because of stuttering ... the model checker allows for the case where the state machine absorbs an X and then simply stops at Loop without doing anything else ever again.
MachineTerminates == <>[](state = "end")
What does fairness actually mean in the context of a simple process like this one? Is there a way to tell the model checker that the state machine will never run out of inputs and thus will always terminate?

Okay, so this is very weird and should pass, but not for the the reason you'd think. To understand why, we first have to talk about the "usual" definition of fairness, then the technical one.
Weak fairness says that if an action is never disabled, it will eventually happen. What you probably expected is the following:
Loop could pick "end" but doesn't.
Loop could pick "end" but doesn't. This happens some arbitrary number of times.
Since Loop is fair*, it is forced to pick "end".
We're done.
But that's not the case. Fairness, in pluscal, is at the label level. It doesn't say anything about what happens in the label, just that label itself must happen. So this is a perfectly valid behavior:
Because loop is never disabled, it eventually happens. It picks "reading".
Because loop is never disabled, it eventually happens. It picks "reading".
Because loop is never disabled, it eventually happens. It picks "reading".
This keeps going forever.
This corresponds to you inputting a string of infinite length, consisting only of X's. If you only want finite strings, you have to explicitly express that, either as part of the spec or as one of the preconditions to your desired temporal property. For example, you could make your temporal property state = "end" ~> Termination, you could model the input string, you could add a count for "how many x's before the y", etc. This gets out of what's actually wrong and into what's good spec design, though.
That's the normal case. But this is a very, very specific exception to the rule. That's because of how "fairness is defined". Formally, WF_vars(A) == <>[](Enabled <<A>>_v) => []<><<A>>_v. We usually translate that as
If the system is always able to do A, then it will keep on doing A.
That's the interpretation I was using up until now. But it is wrong in one very big way. <<A>>_vars == A /\ (vars' /= vars), or "A happens and vars changes. So our definition actually is
If the system is always able to do A in a way that changes its state, then it will keep on doing A in a way that changes its state.
Once you have pc = "Loop" /\ state = "reading", doing state := "reading" does not change the state of the system, so it doesn't count as <<Next>>_vars. So <<Next>>_vars didn't actually happen, but by weak fairness it must eventually happen. The only way for <<Next>>_vars to happen is if the loop sets state := "reading, which allows the while loop to terminate.
This is a pretty unstable situation. Any slight change we make to the spec is likely to push us back into more familiar territory. For example, the following spec will not terminate, as expected:
variables
state = "start";
foo = TRUE;
begin
Loop:
while state /= "end" do
foo := ~foo;
either
\* we got an X, keep going
state := "reading"
or
\* we got a Y, terminate
state := "end"
end either;
end while;
end algorithm;*)
Even though foo doesn't affect the rest of the code, it allows us to have vars' /= vars without having to update state. Similarly, replacing WF_vars(Next) with WF_pc(Next) makes the spec fail, as we can reach a state where <<Next>>_pc is disabled (aka any state where state = "reading").
*Loop isn't fair, the total algorithm is. This can make a big difference in some cases, which is why we spec. But it's easier in this case to talk about Loop, as it's the only action.

Related

Clojure atomic derive and reset of an atom

I wrote a function to get the old value of an atom while putting a new value, all in one atomic operation:
(defn get-and-reset! [at newval]
"Resets atom to newval and returns the old value. Atomic."
(let [tmp (atom [])]
(swap! at #(do (reset! tmp %) newval))
#tmp))
The documentation says the swap! function shouldn't have side effects because it can be called multiple times. That alone doesn't seem like a problem since tmp never leaves the function and it's the last value that it gets reset! to that matters. The function seems to work but I haven't tested it thoroughly with multiple threads, etc. Is this local side-effect a safe exception to the documentation, or am I missing some other subtle problem?
Yes, that will work with the current implementation of atoms in Clojure, and is (almost) guaranteed to work by contract.
The key here is that atoms are sychronous. Therefore, the inner swap! is guaranteed to complete before the outer swap!. Since tmp is only used locally, from a single thread, the inner swap! is also guaranteed not to conflict with a swap! (of tmp) on another thread.
While the outer swap! (i.e., swap! of at) could conflict with other threads, this swap! will retry when a conflict is detected. Since swap! is synchronous, these retries will occur serially w.r.t. the thread the swap! is invoked on. I suppose it's conceivable this last condition does not necessarily hold. E.g., it would be possible for an implementation of atoms to perform the swap! on a different thread, and issue retries as soon as a conflict is detected (without waiting for previous tries to finish). However, that's not the way atoms are currently implemented, and (in my opinion) doesn't seem like a very likely way to implement atoms.
If this weakness bothers you, you can use compare-and-set! instead:
(defn get-and-reset! [at newval]
"Resets atom to newval and returns the old value. Atomic."
(loop [oldval #at]
(if (compare-and-set! at oldval newval)
;; then (no conflict => return oldval)
oldval
;; else (conflict => retry)
(recur #at))))
atom's cannot do what you are trying to do.
Atoms are only defined for uncoordinated syncronous updates of a single identity. for instance the functions used to update atoms may run many times so whatever you do with that value may happen many time for each value that makes it into the atom.
Agents are often a better choice for this sort of thing because If you send an action to an agent it will run at most once:
"At any point in time, at most one action for each Agent is being executed.
Actions dispatched to an agent from another single agent or thread will occur in the order they were sent"
Another option is to add a watch to the agent or atom and have that watch react to each change after it happens. If you can convince your self that neither of these cases work for you then you may have found one of the cases where coordinated change are actually required and then refs would be the better tool, though this is rare. Usually agents or atoms with watches cover most situations.

Reasoning about IORef operation reordering in concurrent programs

The docs say:
In a concurrent program, IORef operations may appear out-of-order to
another thread, depending on the memory model of the underlying
processor architecture...The implementation is required to ensure that
reordering of memory operations cannot cause type-correct code to go
wrong. In particular, when inspecting the value read from an IORef,
the memory writes that created that value must have occurred from the
point of view of the current thread.
Which I'm not even entirely sure how to parse. Edward Yang says
In other words, “We give no guarantees about reordering, except that
you will not have any type-safety violations.” ...
the last sentence remarks that an IORef is not allowed to point to
uninitialized memory
So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).
Things that seem clear to me from the documentation
within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered
What isn't clear to me
Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?
What's the straightforward mental model I should have? Is it something like:
within and from the point of view of an individual thread, the
ordering of IORef operations will appear sane and sequential; but the
compiler may actually reorder operations in such a way that break
certain assumptions in a concurrent system; however when a thread does
atomicModifyIORef, no threads will observe operations on that
IORef that appeared above the atomicModifyIORef to happen after,
and vice versa.
...? If not, what's the corrected version of the above?
If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.
I don't know Haskell concurrency, but I know something about memory models.
Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.
Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:
Thread 1: x=1
Thread 2: y=1
Thread 3: r1=x;
r2=y;
Thread 4: r4=y;
r3=x;
If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).
Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.
This is sufficiently implemented like so:
ordered load:
load x
load-load -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load
ordered store:
load-store
store-store -- ordered store must appear after all stores
-- preceding it in program order - serialize all stores
-- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
-- preceding them in program order
Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).
A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.
Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.
Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.
An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).
Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.
Thread 1: x=1;
if (z==0) compareAndSet(z, 0, y == 0? x: y);
Thread 2: y=2;
if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));
So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:
Thread 1: store x,1
load z
if ==0 then
load y
if == 0 then load x -- if loaded y is still 0, load x into tmp
else load y -- otherwise, load y into tmp
CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
-- the CAS may fail, but see explanation
Thread 2: store y,2
load x
if !=0 then
loop: load z -- into tmp
load y
if !=tmp then -- compare loaded y to tmp
CAS z, tmp, y -- attempt to CAS z: if it is still tmp, set to y
if ! then goto loop -- if CAS did not succeed, go to loop
If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.
If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.
However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.
If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.
On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.
The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)
within a thread an atomicModifyIORef "is never observed to take place
ahead of any earlier IORef operations, or after any later IORef
operations" i.e. we get a partial ordering of: stuff above the atomic
mod -> atomic mod -> stuff after. Although, the wording "is never
observed" here is suggestive of spooky behavior that I haven't
anticipated.
"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").
A readIORef x might be moved before writeIORef y, at least when there
are no data dependencies
True
Logically I don't see how something like readIORef x >>= writeIORef y
could be reordered
Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.
For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).
In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:
writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef
Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).
I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.
Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):
[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]
So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.
http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)
Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.
In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.
that is: the only semantic model you can have right now is "the way its implemented"
shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!
Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html .
Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

Are Data Races bad?

I like to settle a theoretical computing argument.
Assume everything initial 0
Thread0 Thread1
x=1 | y=x
Here we have a data race. As far as I understand (assuming that x fits in the architecture's word-size and is aligned on the word boundary, which it normally would be), the result is either x=1 ^ y=0 or x=1 ^ y=1.
Now my second example uses explicit locking (assume that lock() gets some global lock), and as far as I understand this is not a data race condition anymore.
Thread0 Thread1
lock() | lock()
x=1 | y=x
unlock() | unlock()
However I would argue that both programs are identical, they produce identical output, have identical race issues. Somehow however people are trying to convince me that data race condition is bad, and I don't see why my first program would be worse than my second.
Edit. The full quote from Wikipedia is:
C++11 introduced formal support for multithreading, and defined a data race strictly as a race condition between non-atomic variables. While race conditions in general will continue to exist, a "data race" must be avoided by the programmer, who must assure that only one thread at a time may access any variable if the access is for writing.
Now, assuming this is correct (it's wikipedia, which tends to be reasonably good on programming but can often be very wrong indeed), it's defining "data race" in this context purely as one of the clearly bad cases; those which can cause shearing of values. Such cases obviously must be avoided, so clearly data-races—defined as they are here—must be avoided.
And by this definition, neither program in your question has a data race.
I leave my original answer on race conditions generally:
The second example has a data-race too. Indeed, it has the exact same data-race as the first one.
Is this bad? That depends. Note before any of the rest. Not only are many cases bad, as I'll describe more below, but those cases that are bad tend to be particularly hard to find and fix, which in itself should lean one towards assuming the worse.
An obvious case where a data race is bad is where it corrupts data. Let's say we change your example so that x and y were larger than the architecture's word size and we're setting x = -1. We'll also assume two's-complement. Now the possible values for y are not just -1 and 0, but also -4294967296 and 4294967295.
In this case, the locking you suggest wouldn't remove the data-race completely, but would remove that part of it that could cause shearing: The only possible values of y would again be -1 and 0.
Another question is serialisation. It's often necessary to be able to consider a sequence of concurrent events as having been one of a limited set of sequential events.
For example, consider we start with X = 0 and then have:
Thread 0 Thread 1
++x x = -50
Now, there's still the risk of sheering here that could result in a possible bogus value.
Assuming that x is word-size or smaller, we still might have an issue. There are two possible values if the operations were not concurrent. Either x could be equal to -50 (increment, then assign -50) or x could be equal to -49 (assign -50 then increment). However, concurrently it's possible for us to end up with x having a value of 1 because thread 0 reads 0, thread 1 assigns -50and then thread 0 increments and assigns 1.
Now, it's quite possible that this is perfectly okay. It's very likely though that it isn't.
As programmers we've got four possibilities:
Identify the data-race. Determine that it is harmless (or relatively harmless*), and let it be.
Identify the data-race. Determine that it can cause problems, and fix it.
Identify the data-race. Just fix it because that way we can't make a mistake in determining it is harmless when it actually isn't.
Identify the data-race. Determine that it can cause problems. Change the code so the race doesn't cause problems.
The importance of case number 2 is obvious - we turn code that has a bug into code that isn't.
The importance of case number 3 comes down to time and provability. We might well be making code less efficient (many methods for stopping data-races have at least some overhead), but it often takes less developer time to remove a race than prove it harmless, and the cost of a wrong example is marginally slower code whereas the cost of being wrong in the other direction is a hard to fix bug.
The importance of number 1 is more complicated, it can be important in some very low-level concurrent code to avoid locking, so there are cases where we want to tolerate races. Number 4 is a way to turn something from number 2 into number 1, and comes up when either the data-race is inherent to the problem (we can't remove it) or we're doing the sort of low-level concurrency that number 1 involves.
Here's an interesting example in C#:
public static SomeResource GetTheResource()
{
get
{
if(_theResource == null)
_theResource = CreateTheResource();
return _theResource
}
}
The data-race should be obvious; until theResource is set and all CPU's caches see the update, we might assign to it several times from different threads. Is this a bug? Many people would say it is, but actually it depends. It's possible that it's safe to have a brief period where different versions of theResource are used, and all we really lose is some efficiency in the beginning from the multiple calls to CreateTheResource(). In code with a high requirement for performance we might decide to tolerate this initial lower efficiency for the long-term efficiency gain of no locking. Or it might be vital that we lock. Or we might just lock because we don't have that pressing a need to avoid it, and it's simpler just to assume that the might be a problem.
Important Point 1: If you do decide to tolerate a race like this, you should add a comment to that effect and why. Otherwise every time someone comes across this code they'll have to check again that it's safe, rather than at most check your stated reasoning.
Important Point 2: While the principle here is language-agnostic, the details in each case often are not. In this case tolerating the race depends not just on the temporary multiple copies being safe, but also on garbage collection cleaning those excess copies up. If we were instead assigning a pointer to the heap in C++ the above would at the very best be leaky, even if otherwise safe.
A more complicated case is something like this (again a C# example, but applicable to other languages):
internal sealed class LockFreeQueue<T>
{
private sealed class Node
{
public readonly T Item;
public Node Next;
public Node(T item)
{
Item = item;
}
}
private volatile Node _head;
private volatile Node _tail;
public LockFreeQueue()
{
_head = _tail = new Node(default(T));
}
#pragma warning disable 420 // volatile semantics not lost as only by-ref calls are interlocked
public void Enqueue(T item)
{
Node newNode = new Node(item);
for(;;)
{
Node curTail = _tail;
if (Interlocked.CompareExchange(ref curTail.Next, newNode, null) == null) //append to the tail if it is indeed the tail.
{
Interlocked.CompareExchange(ref _tail, newNode, curTail); //CAS in case we were assisted by an obstructed thread.
return;
}
else
{
Interlocked.CompareExchange(ref _tail, curTail.Next, curTail); //assist obstructing thread.
}
}
}
public bool TryDequeue(out T item)
{
for(;;)
{
Node curHead = _head;
Node curTail = _tail;
Node curHeadNext = curHead.Next;
if (curHead == curTail)
{
if (curHeadNext == null)
{
item = default(T);
return false;
}
else
Interlocked.CompareExchange(ref _tail, curHeadNext, curTail); // assist obstructing thread
}
else
{
item = curHeadNext.Item;
if (Interlocked.CompareExchange(ref _head, curHeadNext, curHead) == curHead)
{
return true;
}
}
}
}
#pragma warning restore 420
}
This code doesn't prevent data-races, but rather it reacts to them. If an operation is affected by another thread, then rather than error or return an incorrect result, the thread deals with the race and returns something else (and indeed even helps the other thread in some cases).
So in summary, data-races are not in and of themselves bad things. They are though complicating things, and those complications can cause problems. When you have a data-race you have a choice between proving it's not a problem, changing your code to tolerate the race so that it's no longer a problem, or changing your code to remove the race. Of these, just removing the race is often the easiest choice.
*I don't mean "relatively harmless" in a vague way here, but relative to the alternative. E.g. if we decide to leave the race in the C# example given, it's because we've decided that the cost of redundant object creation is less harmful than the relative cost of preventing it.
I thank everybody for their answers, although valuable they did not actually answer the question I was hoping I asked. The answers did allow me to reason better about what I was actually asking, and in the end find something of an answer online:
http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
So I guess my question should have been:
The C(++)11 standard defines my first example as a data race (if I don't use the "atomic" keyword), and the second one not. The first one therefore has undefined behaviour (even though there don't seem to be compiler implementations that would result in anything but x==1 && y==0|1, according to the standard any resulting value for x and y is correct compiler behaviour). I was wondering why this is. I think the Intel document answers that question pretty elaborately.
If x and y fit into a machine register then assignment is atomic by default so locks won't change the outcome. It's equally possible to get y = 0 or y = 1 in the second case as well.

What are the C++11 memory ordering guarantees in this corner case?

I'm writing some lock-free code, and I came up with an interesting pattern, but I'm not sure if it will behave as expected under relaxed memory ordering.
The simplest way to explain it is using an example:
std::atomic<int> a, b, c;
auto a_local = a.load(std::memory_order_relaxed);
auto b_local = b.load(std::memory_order_relaxed);
if (a_local < b_local) {
auto c_local = c.fetch_add(1, std::memory_order_relaxed);
}
Note that all operations use std::memory_order_relaxed.
Obviously, on the thread that this is executed on, the loads for a and b must be done before the if condition is evaluated.
Similarly, the read-modify-write (RMW) operation on c must be done after the condition is evaluated (because it's conditional on that... condition).
What I want to know is, does this code guarantee that the value of c_local is at least as up-to-date as the values of a_local and b_local? If so, how is this possible given the relaxed memory ordering? Is the control dependency together with the RWM operation acting as some sort of acquire fence? (Note that there's not even a corresponding release anywhere.)
If the above holds true, I believe this example should also work (assuming no overflow) -- am I right?
std::atomic<int> a(0), b(0);
// Thread 1
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
if (a_local >= 0) { // Always true at runtime
b.fetch_add(1, std::memory_order_relaxed);
}
}
// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
if (b_local < 777) {
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local); // Is this guaranteed?
}
On thread 1, there is a control dependency which I suspect guarantees that a is always incremented before b is incremented (but they each keep being incremented neck-and-neck). On thread 2, there is another control dependency which I suspect guarantees that b is loaded into b_local before a is incremented. I also think that the value returned from fetch_add will be at least as recent as any observed value in b_local, and the assert should therefore hold. But I'm not sure, since this departs significantly from the usual memory-ordering examples, and my understanding of the C++11 memory model is not perfect (I have trouble reasoning about these memory ordering effects with any degree of certainty). Any insights would be appreciated!
Update: As bames53 has helpfully pointed out in the comments, given a sufficiently smart compiler, it's possible that an if could be optimised out entirely under the right circumstances, in which case the relaxed loads could be reordered to occur after the RMW, causing their values to be more up-to-date than the fetch_add return value (the assert could fire in my second example). However, what if instead of an if, an atomic_signal_fence (not atomic_thread_fence) is inserted? That certainly can't be ignored by the compiler no matter what optimizations are done, but does it ensure that the code behaves as expected? Is the CPU allowed to do any re-ordering in such a case?
The second example then becomes:
std::atomic<int> a(0), b(0);
// Thread 1
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
b.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local); // Is this guaranteed?
Another update: After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard. So, can anyone come up with a counter-example of a theoretical system that complies with the standard and also fires the assert?
Signal fences don't provide the necessary guarantees (well, not unless 'thread 2' is a signal hander that actually runs on 'thread 1').
To guarantee correct behavior we need synchronization between threads, and the fence that does that is std::atomic_thread_fence.
Let's label the statements so we can diagram various executions (with thread fences replacing signal fences, as required):
while (true) {
auto a_local = a.fetch_add(1, std::memory_order_relaxed); // A
std::atomic_thread_fence(std::memory_order_acq_rel); // B
b.fetch_add(1, std::memory_order_relaxed); // C
}
auto b_local = b.load(std::memory_order_relaxed); // X
std::atomic_thread_fence(std::memory_order_acq_rel); // Y
auto a_local = a.fetch_add(1, std::memory_order_relaxed); // Z
So first let's assume that X loads a value written by C. The following paragraph specifies that in that case the fences synchronize and a happens-before relationship is established.
29.8/2:
A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
And here's a possible execution order where the arrows are happens-before relations.
Thread 1: A₁ → B₁ → C₁ → A₂ → B₂ → C₂ → ...
↘
Thread 2: X → Y → Z
If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. — [C++11 1.10/18]
So the load at Z must take its value from A₁ or from a subsequent modification. Therefore the assert holds because the value written at A₁ and at all later modifications is greater than or equal to the value written at C₁ (and read by X).
Now let's look at the case where the fences do not synchronize. This happens when the load of b does not load a value written by thread 1, but instead reads the value that b is initialized with. There's still synchronization where the threads starts though:
30.3.1.2/5
Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.
This is specifying the behavior of std::thread's constructor. So (assuming the thread creation is correctly sequenced after the initialization of a) the value read by Z must take its value from the initialization of a or from one of the subsequent modifications on thread 1, which means that the assertions still holds.
This example gets at a variation of reads-from-thin-air like behavior. The relevant discussion in the spec is in section 29.3p9-11. Since the current version of the C11 standard doesn't guarantee dependences be respected, the memory model should allow the assertion to be fired. The most likely situation is that the compiler optimizes away the check that a_local>=0. But even if you replace that check with a signal fence, CPUs would be free to reorder those instructions.
You can test such code examples under the C/C++11 memory models using the open source CDSChecker tool.
The interesting issue with your example is that for an execution to violate the assertion, there has to be a cycle of dependences. More concretely:
The b.fetch_add in thread one depends on the a.fetch_add in the same loop iteration due to the if condition. The a.fetch_add in thread 2 depends on b.load. For an assertion violation, we have to have T2's b.load read from a b.fetch_add in a later loop iteration than T2's a.fetch_add. Now consider the b.fetch_add the b.load reads from and call it # for future reference. We know that b.load depends on # as it takes it value from #.
We know that # must depend on T2's a.fetch_add as T2's a.fetch_add atomic reads and updates a prior a.fetch_add from T1 in the same loop iteration as #. So we know that # depends on the a.fetch_add in thread 2. That gives us a cycle in dependences and is plain weird, but allowed by the C/C++ memory model. The most likely way of actually producing that cycle is (1) compiler figures out that a.local is always greater than 0, breaking the dependence. It can then do loop unrolling and reorder T1's fetch_add however it wants.
After reading all the responses so far and combing through the
standard myself, I don't think it can be shown that the code is
correct using only the standard.
And unless you admit that non atomic operations are magically safer and more ordered then relaxed atomic operations (which is silly) and that there is one semantic of C++ without atomics (and try_lock and shared_ptr::count) and another semantic for those features that don't execute sequentially, you also have to admit that no program at all can be proven correct, as the non atomic operations don't have an "ordering" and they are needed to construct and destroy variables.
Or, you stop taking the standard text as the only word on the language and use some common sense, which is always recommended.

Is it ok to have multiple threads writing the same values to the same variables?

I understand about race conditions and how with multiple threads accessing the same variable, updates made by one can be ignored and overwritten by others, but what if each thread is writing the same value (not different values) to the same variable; can even this cause problems? Could this code:
GlobalVar.property = 11;
(assuming that property will never be assigned anything other than 11), cause problems if multiple threads execute it at the same time?
The problem comes when you read that state back, and do something about it. Writing is a red herring - it is true that as long as this is a single word most environments guarantee the write will be atomic, but that doesn't mean that a larger piece of code that includes this fragment is thread-safe. Firstly, presumably your global variable contained a different value to begin with - otherwise if you know it's always the same, why is it a variable? Second, presumably you eventually read this value back again?
The issue is that presumably, you are writing to this bit of shared state for a reason - to signal that something has occurred? This is where it falls down: when you have no locking constructs, there is no implied order of memory accesses at all. It's hard to point to what's wrong here because your example doesn't actually contain the use of the variable, so here's a trivialish example in neutral C-like syntax:
int x = 0, y = 0;
//thread A does:
x = 1;
y = 2;
if (y == 2)
print(x);
//thread B does, at the same time:
if (y == 2)
print(x);
Thread A will always print 1, but it's completely valid for thread B to print 0. The order of operations in thread A is only required to be observable from code executing in thread A - thread B is allowed to see any combination of the state. The writes to x and y may not actually happen in order.
This can happen even on single-processor systems, where most people do not expect this kind of reordering - your compiler may reorder it for you. On SMP even if the compiler doesn't reorder things, the memory writes may be reordered between the caches of the separate processors.
If that doesn't seem to answer it for you, include more detail of your example in the question. Without the use of the variable it's impossible to definitively say whether such a usage is safe or not.
It depends on the work actually done by that statement. There can still be some cases where Something Bad happens - for example, if a C++ class has overloaded the = operator, and does anything nontrivial within that statement.
I have accidentally written code that did something like this with POD types (builtin primitive types), and it worked fine -- however, it's definitely not good practice, and I'm not confident that it's dependable.
Why not just lock the memory around this variable when you use it? In fact, if you somehow "know" this is the only write statement that can occur at some point in your code, why not just use the value 11 directly, instead of writing it to a shared variable?
(edit: I guess it's better to use a constant name instead of the magic number 11 directly in the code, btw.)
If you're using this to figure out when at least one thread has reached this statement, you could use a semaphore that starts at 1, and is decremented by the first thread that hits it.
I would expect the result to be undetermined. As in it would vary from compiler to complier, langauge to language and OS to OS etc. So no, it is not safe
WHy would you want to do this though - adding in a line to obtain a mutex lock is only one or two lines of code (in most languages), and would remove any possibility of problem. If this is going to be two expensive then you need to find an alternate way of solving the problem
In General, this is not considered a safe thing to do unless your system provides for atomic operation (operations that are guaranteed to be executed in a single cycle).
The reason is that while the "C" statement looks simple, often there are a number of underlying assembly operations taking place.
Depending on your OS, there are a few things you could do:
Take a mutual exclusion semaphore (mutex) to protect access
in some OS, you can temporarily disable preemption, which guarantees your thread will not swap out.
Some OS provide a writer or reader semaphore which is more performant than a plain old mutex.
Here's my take on the question.
You have two or more threads running that write to a variable...like a status flag or something, where you only want to know if one or more of them was true. Then in another part of the code (after the threads complete) you want to check and see if at least on thread set that status... for example
bool flag = false
threadContainer tc
threadInputs inputs
check(input)
{
...do stuff to input
if(success)
flag = true
}
start multiple threads
foreach(i in inputs)
t = startthread(check, i)
tc.add(t) // Keep track of all the threads started
foreach(t in tc)
t.join( ) // Wait until each thread is done
if(flag)
print "One of the threads were successful"
else
print "None of the threads were successful"
I believe the above code would be OK, assuming you're fine with not knowing which thread set the status to true, and you can wait for all the multi-threaded stuff to finish before reading that flag. I could be wrong though.
If the operation is atomic, you should be able to get by just fine. But I wouldn't do that in practice. It is better just to acquire a lock on the object and write the value.
Assuming that property will never be assigned anything other than 11, then I don't see a reason for assigment in the first place. Just make it a constant then.
Assigment only makes sense when you intend to change the value unless the act of assigment itself has other side effects - like volatile writes have memory visibility side-effects in Java. And if you change state shared between multiple threads, then you need to synchronize or otherwise "handle" the problem of concurrency.
When you assign a value, without proper synchronization, to some state shared between multiple threads, then there's no guarantees for when the other threads will see that change. And no visibility guarantees means that it it possible that the other threads will never see the assignt.
Compilers, JITs, CPU caches. They're all trying to make your code run as fast as possible, and if you don't make any explicit requirements for memory visibility, then they will take advantage of that. If not on your machine, then somebody elses.

Resources