Does `isync` prevent Store-Load reordering on CPU PowerPC?

Does `isync` prevent Store-Load reordering on CPU PowerPC? - multithreading

As known, PowerPC has weak memory model, that permit any speculative reordering: Store-Store, Load-Store, Store-Load, Load-Load.
There are at least 3 Fences:
hwsync or sync - full memory barrier, prevents any reordering
lwsync - memory barriers that prevents reordering: Load-Load, Store-Store, Load-Store
isync - instruction barrier: https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.alangref/idalangref_isync_ics_instrs.htm
For example, can be reordered Store-stwcx. and Load-lwz in this code?: https://godbolt.org/g/84t5jM
lwarx 9,0,10
addi 9,9,2
stwcx. 9,0,10
bne- 0,.L2
isync
lwz 9,8(1)
As known, isync prevents reordering lwarx,bne <--> any following instructions.
But does isync prevent reordering stwcx.,bne <--> any following instructions?
I.e. can Store-stwcx. begins earlier than the following Load-lwz, and finishes performed later than Load-lwz?
I.e. can Store-stwcx. preforms Store to the Store-Buffer earlier than the following Load-lwz begun, but the actual Store to the cache that visible for all CPU-cores occurs later than the Load-lwz finished?
As we see from the following documents, articles and books:
isync is not memory fence, but it is only instruction fence.
isync does not force all external accesses to complete with respect to other processors and mechanisms that access memory.
isync does not wait for all other processors to detect storage accesses
isync is a very low-overhead and very weak (lower than lwsync and hwsync)
isync does not guarantee that previous and future stores will be perceived by other processors in the locally issued order - that requires one of the sync instructions.
isync is acquire barrier, but as we known, acquire can be applied only to Load-operations, not for Store (stwcx.)
isync does not affect data accesses and does not wait for all stores to be performed.
The main question, initially: a=0, b=0
if CPU-Core-0 do: stwcx. [a]=1 bne- isync lwz [b].
And CPU-Core-1 do: hwsync stw [b]=1 hwsync lwz [a] hwsync.
Then can Core-0 see [b]==1 and Core-1 see [a]==0?
Also:
https://www.ibm.com/developerworks/systems/articles/powerpc.html
The isync prevents speculative execution from accessing the data block
before the flag has been set. And in conjunction with the preceding
load, compare, and conditional branch instructions, the isync
guarantees that the load on which the branch depends (the load of the
flag) is performed prior to any loads that occur subsequent to the
isync (loads from the shared block).
isync is not a memory barrier instruction, but the
load-compare-conditional branch-isync sequence can provide this
ordering property.
http://www.nxp.com/assets/documents/data/en/application-notes/AN2540.pdf
Unlike isync, sync forces all external accesses to complete with
respect to other processors and mechanisms that access memory.
Storage in the PowerPC Janice M. Stone, Robert P. Fitzgerald, 1995: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.4033&rep=rep1&type=pdf
Unlike sync , isync does not wait for all other processors to detect
storage accesses. isync is a less conservative fence than sync because
it does not delay until all processors detect previous loads and
stores.
http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
bc;isync: this is a very low-overhead and very weak form of memory
fence. A specific set of preceding loads on which the bc (branch
conditional) instruction depends are guaranteed to have completed
before any subsequent instruction begins execution. However,
store-buffer and cache-state effects can nevertheless make it appear
that subsequent loads occur before the preceding loads upon which the
twi instruction depends. That said, the PowerPC architecture does not
permit stores to be executed speculatively, so any store following the
twi;isync instruction is guaranteed to happen after any of the loads
on which the bc depends.
https://books.google.ru/books?id=TKOfDQAAQBAJ&pg=PA264&lpg=PA264&dq=isync+store+load&source=bl&ots=-4FyWvxTwg&sig=r1fitaG-Q3GHOxvSMTgLJMBVGUU&hl=ru&sa=X&ved=0ahUKEwiKjYK97urTAhUJ_iwKHbfMA58Q6AEIOjAC#v=onepage&q=isync%20store%20load&f=false
https://books.google.ru/books?id=gZZgAQAAQBAJ&pg=PA71&lpg=PA71&dq=isync+store+load&source=bl&ots=bo6nTLdzEZ&sig=vCjoDmUWhn0buN_uMf8XgbDzCf4&hl=ru&sa=X&ved=0ahUKEwiKjYK97urTAhUJ_iwKHbfMA58Q6AEIcTAJ#v=onepage&q=isync%20store%20load&f=false
https://books.google.ru/books?id=G2fmCgAAQBAJ&pg=PA321&lpg=PA321&dq=isync+store+load&source=bl&ots=YS4mE-4f_F&sig=OVwaJYE-SNnor-KtKrjlkOd6AOs&hl=ru&sa=X&ved=0ahUKEwiKjYK97urTAhUJ_iwKHbfMA58Q6AEIYjAH#v=onepage&q&f=false
http://www.nxp.com/assets/documents/data/en/application-notes/AN3441.pdf
Note that isync does not affect data accesses and does not wait for
all stores to be performed.
Page 77: https://www.setphaserstostun.org/power8/POWER8_UM_v1.3_16MAR2016_pub.pdf
3.5.7.2 Instruction Cache Block Invalidate (icbi)
As a result of this and other implementation-specific design
optimizations, instead of requiring the instruction sequence specified
by the Power ISA to be executed on a per cache-line basis, software
must only execute a single sequence of three instructions to make any
previous code modifications become visible: sync, icbi (to any
address), isync.
ANSWER:
So, isync doesn't guarantee Store-Load order, because "isync is not a memory barrier instruction", then isync doesn't guarantee that any previous stores will be visible to other CPU-Cores (uses sequential-consistency) before next intruction will be finished. Instruction synchronization command isync guarantees only the order of starting instructions, but does not guarantee the order of completion of instructions, i.e. does not guarantee the order of their visible effect to other CPU-Cores. Those, isync allows to reorder visible effect of Store-Load in this code stwcx. [a]=1; bne-; isync; lwz [b].

As you have guessed and most of your excellent sources imply, there are two properties of a memory access involved here:
Visibility
If other processors can obverse the memory access.
The use of processor-specific buffers or caches can make a store complete on a processor yet make it not visible to other ones.
Ordering
When the memory access is executed with respected to other instructions on the same processor.
Ordering is an intra-processor aspect of a memory access, it controls the out-of-order capability of a processor.
Ordering cannot be done with respect to other processors' instructions.
Visibility is an inter-processor aspect, it ensures that the side effects of a memory access are visible to other processors (or in general, to other agents).
A store primary side effect is changing a memory location.
By controlling both aspects it is possible to enforce a inter-process Ordering, that is, the order in which other processors see a sequence of memory accesses.
It goes untold that the word "ordering" usually refers to this second meaning unless used in a context where no other agents are present.
It is admittedly a confusing terminology.
Beware that I'm not confident with the PowerPC architecture, I'm just applying the theory with the help of a few official documents found online and the quotes you provided.
isync, just like sc and rfi are Context-Synchronizing instructions, their main purpose is to guarantee that subsequent instructions execute in the context established by the previous ones.
For example, executing a system call changes the context and we don't want the privileged code to execute in an unprivileged context and vice versa.
These instructions wait for all previously dispatched instructions to be completed but not to be visible
All previously issued instructions have completed, at least to a point where they can no longer
cause an exception.
However, memory accesses that these instructions cause need not have
completed with respect to other processors and mechanisms.
So, depending on what you mean by reordering, isync does or does not prevent Load-Load, Load-Store etc. reordering.
It does prevent any of such reordering from the perspective of the processor it is executed on (intra-process reordering) - all previous loads and stores are completed before isync complete but they are not necessarily visible.
It does not prevent reordering from the perspective of other processors (inter-process reordering) as it doesn't ensure the visibility of previous instructions.
But does isync prevent reordering stwcx.,bne <--> any following instructions?
Only intra-process reordering.
I.e. can Store-stwcx. begins earlier than the following Load-lwz, and finishes performed later than Load-lwz?
Not from the point-of-view of the processor executing them, stwcx. is completed by the time lwz begins but, using Intel terminology, it is completed locally - other processors may not see it completed by the time lwz begins.
I.e. can Store-stwcx. preforms Store to the Store-Buffer earlier than the following Load-lwz begun, but the actual Store to the cache that visible for all CPU-cores occurs later than the Load-lwz finished?
Yes, exactly.

Related

Does memory fencing blocks threads in multi-core CPUs?

I was reading the Intel instruction set guide 64-ia-32 guide
to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.
Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).
Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.
mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).
mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains. See my answer on
Are loads and stores the only instructions that gets reordered? for details; this extra slowdown was to fix an erratum. But locked operations and xchg aren't like that on Skylake; they're full memory barriers but they still allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.
With hyperthreading, I think this stalling happens per logical thread, not the whole core.
But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.
sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.
It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)
See also Does a memory barrier ensure that the cache coherence has been completed?
lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.
The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)
Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.
lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?
Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).
Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.
So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

What is the opposite of a "full memory barrier"?

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:
If we have the following instructions:
instruction 1
full_memory_barrier
instruction 2
Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.
But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?
If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:
instruction 1
memory_barrier_below_to_above
instruction 2
Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:
instruction 2
instruction 1
memory_barrier_below_to_above
But the following will be allowed (which makes this type of memory barrier pointless):
memory_barrier_below_to_above
instruction 2
instruction 1

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.
On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.
Barrier examples:
strong weak
x86 mfence none needed unless you're using NT stores
ARM dmb sy isb, dmb st, dmb ish, etc.
POWER hwsync lwsync, isync, ...
ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.
ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)
POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.
A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.
Jeff Preshing has an article about this, too: Acquire and Release semantics
The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.
This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.
You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.
What exactly does atomic_thread_fence(memory_order_release) do?
C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with #EOF in comments brought up the C11 version.)
This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says
3) A release fence A synchronizes with an atomic operation B that performs an acquire
operation on an atomic object M if there exists an atomic operation X such that A is
sequenced before X, X modifies M, and B reads the value written by X or a value written
by any side effect in the hypothetical release sequence X would head if it were a release
operation
So in the writing thread, it's talking about code like this:
stuff // including any non-atomic loads/stores
atomic_thread_fence(mo_release) // A
M=X // X
// threads that see load(M, acquire) == X also see stuff
The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.
This lets us say something about what is / isn't allowed:
It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.
It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).
It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)
As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.
Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).
CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
But with just barriers against reordering within a physical core, sequential consistency can be recovered.)
A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.
Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

Does an x86 CPU reorder instructions?

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.
To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.
But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?

Reordering
Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.
Memory Barriers
That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.
Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.
So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.
As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.
So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).
x86 re-ordering
In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.
At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:
Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
Local load operations are never reordered with respect to other local load operations.
Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.
So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?
Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.
In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.
So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).
The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.
Jeff Preshing's Memory Reordering Caught in the Act
has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.
1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.

How does the cache coherency protocol enforce atomicity?

I understand atomicity can be guaranteed on operations like xsub(), without using the LOCK prefix, by relying on the cache coherency protocol (MESI/MESIF).
1) How can the cache coherency protocol do this???
Its making me wonder if the cache coherency protocol can enforce atomicity, why do we need special atomic types/instructions etc?
2) If MOSI implements atomic instructions across multi-core systems then what is the purpose of LOCK? Legacy?
3) If MOSI implements atomic instructions and MOSI is used for all instructions- then why do atomic instructions cost so much? Surely the performance should be same as normal instructions.

Atomicity and Memory Ordering
For an operation to be atomic it must appear to be one undivided operation to any observer. That observer can be anything that can see the effect of the operation, whether its the thread does the operation, a different thread on the same processor a thread on different processor, or some component or device in the system. Observers that can't see the effect of the operation, whether the same thread, a different thread, or a device, don't affect whether the operation is atomic or not.
(Note that by processor I mean what Intel's documentation would call a logical processor. A system with two CPU sockets, each populated with a quad-core CPU with two logical processors per core would have a total of 16 processors.)
A related but different concept is memory ordering. Memory accesses are only sequentially consistent if they appear to an observer as happening in the order they occur in the program. This guarantee always applies then when the observer is the same thread as performed the operations. Other more limited guarantees of memory ordering are possible. A strong but not sequentially consistent ordering might guarantee many sorts of operations are ordered with respect to each other, but not all. A weak memory ordering provides no guarantees about how accesses appear to other threads.
Compilers and Atomicity
When you're writing a program in C or some other higher level language it may appear that certain operations are atomic and sequentially ordered, but the compiler only generally guarantees this when viewed from the same thread that performed those operations. However, from the compiler's perspective any code that runs when a thread is asynchronously interrupted happens in different thread of execution even if that code runs in the same OS thread. That means the code running in a signal handler or in a structured exception handler isn't guaranteed to see operations performed outside the the handler in the same thread as being atomic or sequentially consistent.
Because of the limited general guarantee the compiler is free to do things like implement what look to be atomic operations using multiple assembler instructions make them appear non-atomic to other observers. The compiler can also reorder memory accesses, even remove apparently redundant accesses entirely. It can do whatever optimizations it wants so long in the single uninterrupted thread case the program still behaves as if it were doing all those operations in program order.
In the multi-threaded case, or where signal or exception handlers a present, it's necessary to take special steps to inform the compiler where you need it to provide broader guarantees of atomicity and memory ordering. That's the purpose special atomic types and functions. Even if the CPU guarantees every instruction is atomic and every memory access is sequentially consistent to all other threads, the compiler doesn't.
Intel CPUs and Atomicity
Intel CPUs make it fairly easy for the compiler to provide these guarantees. Except for some odd cases, instructions are uninterruptable. Any event that causes the execution of an instruction to be interrupted either happens after the instruction is fully completed or allows the instruction to resumed as if it were never executed. The means that at the machine code level every operation is atomic and every memory operation is sequentially consistent as it appears to code running on the same processor. In the single processor case nothing needs to be done provide these guarantees except when they need to be to visible to devices other than the processor. In that case the LOCK prefix combined with uncached memory regions must be used to guarantee read/modify/write instructions are atomic and memory accesses appear sequentially consistent to other devices.
In the multi-processor case when accessing cached memory the cache coherency protocol provides guarantees of atomicity with most instructions and a strong memory ordering but not a sequentially consistent ordering. The exact mechanism by which is does this doesn't matter much, just the guarantees is gives. Any instruction that only accesses a single memory location will appear atomic to other processors. The ordering guarantees are too long to go into here, Intel uses 16 bullet points to describe them, but they apparently its a superset the guarantees that C and C++ provide with the acquire and release memory order. When that level of memory ordering is specified, the C/C++ atomic operations can use ordinary unlocked instructions.
The need for the LOCK prefix, and those instructions where the LOCK prefix is implicit, comes when you need stronger guarantees than the cache coherency protocol provides. If you need your read/modifiy/write instructions to be atomic you need to use the LOCK prefix. If you need sequentially consistent ordering you need to use the LOCK prefix.
The LOCK prefix is where the high cost of atomic operations comes from. It causes the processor to wait for all previous load and store operations to complete. Even though when accessing cached memory the LOCK prefix handled entirely within the cache without asserting LOCK#, the processor still needs to wait to ensure the operation appears sequentially consistent to other processors.
Summary
So in summary the answers to your questions are:
The cache coherency protocol can only enforce atomicity of certain machine code instruction when viewed from other processors. It can't ensure that the compiler generates a single instruction for an operation you want to be atomic. It also can't guarantee that the instruction appears to be atomic to non-processor devices on the system.
The LOCK prefix is used on machine code instructions that
perform multiple memory accesses and need appear to be atomic to other processors
need to be sequentially consistent to other processors
need to be atomic and/or sequentially consistent to other non-processor devices.
When its possible to get the necessary atomicity and memory ordering guarantees without using the LOCK prefix, the instructions used are the same as ordinary instructions and so cost the same. Where LOCK prefix is needed to provide the necessary guarantees the cost of the instruction becomes much higher than a normal instruction.

There is no xsub instruction in x86, but there is an xadd ;)
You should read the section about the LOCK prefix in the Instruction Set Reference, and the section 8.1 LOCKED ATOMIC OPERATIONS in the Software Developer's Manual Volume 3A: System Programming Guide, Part 1.
The single CPU refers to a single core nowadays, with its own cache. When you have multiple caches for multiple cores (physically in the same or separate cpu chips) they use some cache coherency protocol. In case of MESI, the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and marks it modified, additionally locking it. If another core needs the cache line, it will do a read operation which the owner core will snoop and delay the answer until the atomic operation completes.
On single-cpu single-core systems, most instructions are atomic with respect to threading except for string instructions using a REP prefix because scheduling interrupts and thus context switches only happen on instruction boundaries. A hardware device could however observe non-atomic behaviour.

How is atomicity implemented by the CPU?

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string