In which point compiler needs to reorder executions for optimization?

In which point compiler needs to reorder executions for optimization? - multithreading

I was checking about mutexes, semaphores, spin_locks memory barriers etc. and I just come up to execution reorder thing. I read something about at wiki but it really doesn't make any sense to me, reordering execution for optimization concern ? Isn't that break the code ? What is the restrictions. It says in this wiki page Java Memory Model
On modern platforms, code is frequently not executed in the order it was written. It is reordered by the compiler, the processor and the memory subsystem to achieve maximum performance. On multiprocessor architectures, individual processors may have their own local caches that are out of sync with main memory.
So specifically multi-threaded concept it brings performance but it makes your program unstable, or incosistent. You have to be extremely careful. So isn't this over complexity for the specific performance reasons, and code reordering it looks scary.

As good scenario for reordering consider the following:
A = B / C; => read B, C + compute + write to A
D = E + F; => read E, F + compute + write to D
which translates as:
read B, C
compute B/C
write result to A
read E, F
compute E+F
write result to D
can be reordered as:
read B, C, E, F ==> force reads as soon as possible
compute B / C => slow operation (could do something usefull such as read E, F in the meanwhile)
compute E + F
write A, write D => defer writes as late as possible
This does not break the single threaded ordering guarantee and achieves better execution throughput.
Also, here the concepts are pretty good described.

Isn't that break the code ?
Optimizations are restricted to changes that do not violate the guarantees that the language makes. The key point is that many languages do not fully specify behavior. When two threads execute x = 1 and x = 2 the result is not fully specified.
So isn't this over complexity for the specific performance reasons, and code reordering it looks scary.
Whether it's overly or not overly complicated depends on your goals. Some languages are extremely safe, others are very loose. All languages give you tools to write safe programs. The tools vary as well as the difficulty to make it work. Also, performance varies.

Here is an example of the code which is likely to be reordered:
extern bool b;
extern int x;
extern int y;
extern int z;
void foo() {
z = x;
if (b)
y = x;
}
The alleged idea is following: first set global z to something, and if b is set, set y as well. It is assumed that there is another thread, which is somehow querying z, and once it becomes what it wants, it sets b - thus allowing foo() to assign y.
This code, of course, has multiple issues, but I will focus on reordering. The code produces following assembly:
foo():
cmpb $0, b(%rip)
movl x(%rip), %eax
movl %eax, z(%rip)
je .L1
movl %eax, y(%rip)
.L1:
rep ret
As you see, b is checked first! than x is copied to z, and than jump is performed. Execution has been reordered - boolean flag is checked before the assignment is done, and the whole idea is broken. (it would be broken anyways, so this example is just a reordering illustration).

Related

Peterson's solution just use one variable

For Pi:
do {
turn = i; // prepare enter section
while(turn==j);
//critical section
turn = j; //exit section.
} while(true);
For Pj:
do {
turn = j; // prepare enter section
while(turn==i);
//critical section
turn = i; //exit section.
} while(true);
In this simplified algorithm, if process i want to enter critical section for i, it will set "turn = i"(different from Peterson's solution which will set "turn = j"). this algorithm does not seem to cause deadlock or starvation, so why Peterson's algorithm not simplified like this?
Another Question: as i know, mutual exclusion mechanisms such as semaphore P/V operations require atomicity (P should do test sem.value and sem.value-- concurrently). but why the algorithm above just use one variable turn does not seem to require atomicity (turn = i, test turn == j not atomicity )?

Before you ask whether the algorithm avoids deadlock and starvation, you first have to verify that it still locks. With your version, even assuming sequential consistency, the operations could be sequenced like this:
Pi Pj
turn = i;
while (turn == j); // exits immediately
turn = j;
while (turn == i); // exits immediately
// critical section // critical section
and you have a lock violation.
To your second question: it depends on what you mean by "atomicity". You do need it to be the case that when one thread stores turn = i; then the other thread loading turn will only read i or j and not anything else. On some machines, depending on the type of turn and the values of i and j, you could get tearing and load an entirely different value. So whatever language you are using may require you to declare turn as "atomic" in some fashion to avoid this. In C++ in particular, if turn isn't declared std::atomic, then any concurrent read/write access is a data race, and the behavior of the entire program becomes undefined (that's bad).
Besides the need to avoid tearing and data races, Peterson's algorithm also requires strict memory ordering (sequential consistency), which on many systems / languages is not guaranteed unless specially requested, again perhaps by declaring the variable as atomic in some fashion.
It is true that unlike more typical lock algorithms, Peterson doesn't require an atomic read-modify-write, only atomic sequentially consistent loads and stores. That's precisely what makes it an interesting and clever algorithm. But there's a substantial tradeoff in complexity and performance, especially if you want more than two threads, and most real-life systems do have reasonably efficient atomic RMW instructions, so Peterson is rarely used in practice.

According to this function, can y = 8? [duplicate]

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret

Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.

...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99

The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.

Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.

On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.

Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.

No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/

That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq

Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.

When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^

Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

What occurs when 3 "stores" happen sequentially and only one is atomic

I tried to boil this down to a simple example for the sake of clarity. I have an atomic flag of sorts that is used to indicate that one thing just completed and another has not yet started. Both of those things involve storing data in a buffer. I'm trying to figure out how rusts Release ordering works specifically in order to understand how to do this. Consider the "very oversimplified" example:
use std::sync::atomic::{AtomicU32,Ordering};
fn main(){
let mut a = 0;
let mut b = AtomicU32::new(0);
let mut c = 0;
// stuff happens
a = 10;
b.store(11,Ordering::Release);
c = 11;
}
In particular, it is imperative to maintain a type invariant that the atomic store to variable b happens after a and before c, but neither of those variables or their store operations can be atomic in reality (yes, in the example they can be, but this is for simplification/visualization). I would like to avoid a mutex if I can (I don't want to detract from the question with why).
When I read up on Release ordering, it indicates strongly that the assignment to variable "a" would have to occur before the store to b:
When coupled with a store, all previous operations become ordered before any load of this value with Acquire (or stronger) ordering. In particular, all previous writes become visible to all threads that perform an Acquire (or stronger) load of this value. Notice that using this ordering for an operation that combines loads and stores leads to a Relaxed load operation! This ordering is only applicable for operations that can perform a store. Corresponds to memory_order_release in C++20.
However, it makes no guarantee that the assignment to variable c could not be moved before the store to variable b. Almost everything I read always says that stores/loads before the atomic operation are guaranteed to happen before but makes no guarantees about moving operations in the other direction across the boundary.
Am I correct in worrying that the assignment to variable c could be moved before the store to b if Release ordering is used?
I looked at other questions such as Which std::sync::atomic::Ordering to use? and other similar stack overflow questions, but they don't cover whether or not c can be moved before b using release as far as I can see.

In answer to my own question: Yes, I should be worried that the assignment to C could be reordered before B as the "Release" ordering only prevents A being moved past B. By placing a fence with Release ordering between the assignments to B and C, I can further prevent C from being reordered before B (since it will prevent B after C, which is the same thing).
That all applies to the CPU store/load ordering. Whether or not the atomic store and the fence prevent the compiler from also moving those operations depends on the compiler and its documentation should be consulted.

Reasoning about IORef operation reordering in concurrent programs

The docs say:
In a concurrent program, IORef operations may appear out-of-order to
another thread, depending on the memory model of the underlying
processor architecture...The implementation is required to ensure that
reordering of memory operations cannot cause type-correct code to go
wrong. In particular, when inspecting the value read from an IORef,
the memory writes that created that value must have occurred from the
point of view of the current thread.
Which I'm not even entirely sure how to parse. Edward Yang says
In other words, “We give no guarantees about reordering, except that
you will not have any type-safety violations.” ...
the last sentence remarks that an IORef is not allowed to point to
uninitialized memory
So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).
Things that seem clear to me from the documentation
within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered
What isn't clear to me
Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?
What's the straightforward mental model I should have? Is it something like:
within and from the point of view of an individual thread, the
ordering of IORef operations will appear sane and sequential; but the
compiler may actually reorder operations in such a way that break
certain assumptions in a concurrent system; however when a thread does
atomicModifyIORef, no threads will observe operations on that
IORef that appeared above the atomicModifyIORef to happen after,
and vice versa.
...? If not, what's the corrected version of the above?
If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.

I don't know Haskell concurrency, but I know something about memory models.
Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.
Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:
Thread 1: x=1
Thread 2: y=1
Thread 3: r1=x;
r2=y;
Thread 4: r4=y;
r3=x;
If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).
Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.
This is sufficiently implemented like so:
ordered load:
load x
load-load -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load
ordered store:
load-store
store-store -- ordered store must appear after all stores
-- preceding it in program order - serialize all stores
-- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
-- preceding them in program order
Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).
A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.
Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.
Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.
An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).
Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.
Thread 1: x=1;
if (z==0) compareAndSet(z, 0, y == 0? x: y);
Thread 2: y=2;
if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));
So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:
Thread 1: store x,1
load z
if ==0 then
load y
if == 0 then load x -- if loaded y is still 0, load x into tmp
else load y -- otherwise, load y into tmp
CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
-- the CAS may fail, but see explanation
Thread 2: store y,2
load x
if !=0 then
loop: load z -- into tmp
load y
if !=tmp then -- compare loaded y to tmp
CAS z, tmp, y -- attempt to CAS z: if it is still tmp, set to y
if ! then goto loop -- if CAS did not succeed, go to loop
If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.
If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.
However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.
If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.
On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.
The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)

within a thread an atomicModifyIORef "is never observed to take place
ahead of any earlier IORef operations, or after any later IORef
operations" i.e. we get a partial ordering of: stuff above the atomic
mod -> atomic mod -> stuff after. Although, the wording "is never
observed" here is suggestive of spooky behavior that I haven't
anticipated.
"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").
A readIORef x might be moved before writeIORef y, at least when there
are no data dependencies
True
Logically I don't see how something like readIORef x >>= writeIORef y
could be reordered
Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.
For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).
In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:
writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef
Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).
I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.
Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):
[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]
So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.

http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)
Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.
In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.
that is: the only semantic model you can have right now is "the way its implemented"
shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!
Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html .
Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

Is there a way I can make two reads atomic?

I'm running into a situation where I need the atomic sum of two values in memory. The code I inherited goes like this:
int a = *MemoryLocationOne;
memory_fence();
int b = *MemoryLocationTwo;
return (a + b) == 0;
The individual reads of a and b are atomic, and all writes elsewhere in the code to these two memory locations are also lockless atomic. However the problem is that the values of the two locations can and do change between the two reads.
So how do I make this operation atomic? I know all about CAS, but it tends to only involve making read-modify-write operations atomic and that's not quite what I want to do here.
Is there a way to do it, or is the best option to refactor the code so that I only need to check one value?
Edit: Thanks, I didn't mention that I wanted to do this locklessly in the first revision, but some people picked up on it after my second revision. I know no one believes people when they say things like this, but I can't use locks practically. I'd have to emulate a mutex with atomics and that'd be more work than refactoring the code to keep track of one value instead of two.
For now my method of investigation involves taking advantage of the fact that the values are consecutive and grabbing them atomically with a 64 bit read, which I'm assured are atomic on my target platforms. If anyone has new ideas, please contribute! Thanks.

If you truly need to ensure that a and b don't change while you are doing this test, then you need to use the same synchronization for all access to a and b. That's your only choice. Each read and each write to either of these values needs to use the same memory fence, synchronizer, semaphore, timeslice lock, or whatever mechanism is used.
With this, you can ensure that if you:
memory_fence_start();
int a = *MemoryLocationOne;
int b = *MemoryLocationTwo;
int test = (a + b) == 0;
memory_fence_stop();
return test;
then a will not change while you are reading b. But again, you have to use the same synchronization mechanism for all access to a and to b.
To reflect a later edit to your question that you are looking for a lock-free method, well, it depends entirely on the processor you are using and on how long a and b are and on whether or not these memory locations are consecutive and aligned properly.
Assuming these are consecutive in memory and 32 bits each and that your processor has an atomic 64-bit read, then you can issue an atomic 64-bit read to read the two values in, parse the two values out of the 64-bit value, do the math and return what you want to return. Assuming you never need an atomic update to "a and b at the same time" but only atomic updates to "a" or to "b" in isolation, then this will do what you want without locks.

You would have to ensure that everywhere either of the two values were read or written, they were surrounded by a memory barrier (lock or critical section).
// all reads...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
a = *MemoryLocationOne;
b = *MemoryLocationTwo;
}
...
// all writes...
lock(lockProtectingAllAccessToMemoryOneAndTwo)
{
*MemoryLocationOne = someValue;
*MemoryLocationTwo = someOtherValue;
}

If you are targeting x86, you can use the 64-bit compare/exchange support and pack both int's into a single 64-bit word.
On Windows, you would do this:
// Skipping ensuring padding.
union Data
{
struct members
{
int a;
int b;
};
LONGLONG _64bitData;
};
Data* data;
Data captured;
do
{
captured = *data;
int result = captured.members.a + captured.members.b;
} while (InterlockedCompareExchange64((LONGLONG*)&data->_64bitData,
captured._64BitData,
captured._64bitData) != captured._64BitData);
Really ugly. I'd suggest using a lock - much more maintainable.
EDIT:
To update and read the individual parts:
data->members.a = 0;
fence();
data->members.b = 0;
fence();
int captured = data->members.a;
int captured = data->members.b;

There really is no way to do this without a lock. No processors have a double atomic read, as far as I know.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string