Does an x86 CPU reorder instructions?

Does an x86 CPU reorder instructions? - multithreading

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.
To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.
But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?

Reordering
Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.
Memory Barriers
That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.
Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.
So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.
As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.
So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).
x86 re-ordering
In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.
At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:
Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
Local load operations are never reordered with respect to other local load operations.
Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.
So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?
Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.
In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.
So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).
The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.
Jeff Preshing's Memory Reordering Caught in the Act
has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.
1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.

Related

How to reproduce and test the function of memory barrier in X86 Linux [duplicate]

I've been studying the memory model and saw this (quote from https://research.swtch.com/hwmm):
Litmus Test: Write Queue (also called Store Buffer)
Can this program see r1 = 0, r2 = 0?
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!
Fact 1: This is the store buffer litmus test mentioned in many articles. They all say that both r1 and r2 being zero could happen on TSO because of the existence of the store buffer. They seem to assume that all the stores and loads are executed in order, and yet the result is both r1 and r2 being zero. This later concludes that "store/load reordering could happen", as a "consequence of the store buffer's existence".
Fact 2: However we know that OoO execution could also reorder the store and the load in both threads. In this sense, regardless of the store buffer, this reordering could result in both r1 and r2 being zero, as long as all four instructions retire without seeing each other's invalidation to x or y. And this seems to me that "store/load reordering could happen", just because "they are executed out of order". (I might be very wrong about this since this is the best I know of speculation and OoO execution.)
I wonder how these two facts converge (assuming I happen to be right about both): Is store buffer or OoO execution the reason for "store/load reordering", or both are?
Alternatively speaking: Say I somehow observed this litmus test on an x86 machine, was it because of the store buffer, or OoO execution? Or is it even possible to know which?
EDIT: Actually my major confusion is the unclear causality among the following points from various literatures:
OoO execution can cause the memory reordering;
Store/load reordering is caused by the store buffer and demonstrated by a litmus test (and thus named as "store buffer");
Some program having the exact same instructions as the store buffer litmus test is used as an observable OoO execution example, just as this article https://preshing.com/20120515/memory-reordering-caught-in-the-act does.
1 + 2 seems to imply that the store buffer is the cause, and OoO execution is the consequence. 3 + 1 seems to imply that OoO execution is the cause, and memory reordering is the consequence. I can no more tell which causes which. And it is that litmus test sitting in the middle of this mystery.

It makes some sense to call StoreLoad reordering an effect of the store buffer because the way to prevent it is with mfence or a locked instruction that drains the store buffer before later loads are allowed to read from cache. Merely serializing execution (with lfence) would not be sufficient, because the store buffer still exists. Note that even sfence ; lfence isn't sufficient.
Also I assume P5 Pentium (in-order dual-issue) has a store buffer, so SMP systems based on it could have this effect, in which case it would definitely be due to the store buffer. IDK how thoroughly the x86 memory model was documented in the early days before PPro even existed, but any naming of litmus tests done before that might well reflect in-order assumptions. (And naming after might include still-existing in-order systems.)
You can't tell which effect caused StoreLoad reordering. It's possible on a real x86 CPU (with a store buffer) for a later load to execute before the store has even written its address and data to the store buffer.
And yes, executing a store just means writing to the store buffer; it can't commit from the SB to L1d cache and become visible to other cores until after the store retires from the ROB (and thus is known to be non-speculative).
(Retirement happens in-order to support "precise exceptions". Otherwise, chaos ensues and discovering a mis-predict might mean rolling back the state of other cores, i.e. a design that's not sane. Can a speculatively executed CPU branch contain opcodes that access RAM? explains why a store buffer is necessary for OoO exec in general.)
I can't think of any detectable side-effect of the load uop executing before the store-data and/or store-address uops, or before the store retires, rather than after the store retires but before it commits to L1d cache.
You could force the latter case by putting an lfence between the store and the load, so the reordering is definitely caused by the store buffer. (A stronger barrier like mfence, a locked instruction, or a serializing instruction like cpuid, will all block the reordering entirely by draining the store buffer before the later load can execute. As an implementation detail, before it can even issue.)
A normal out of order exec treats all instructions as speculative, only becoming non-speculative when they retire from the ROB, which is done in program order to support precise exceptions. (See Out-of-order execution vs. speculative execution for a more in-depth exploration of that idea, in the context of Intel's Meltdown vulnerability.)
A hypothetical design with OoO exec but no store buffer would be possible. It would perform terribly, with each store having to wait for all previous instructions to be definitively known to not fault or otherwise be mispredicted / mis-speculated before the store can be allowed to execute.
This is not quite the same thing as saying that they need to have already executed, though (e.g. just executing the store-address uop of an earlier store would be enough to know it's non-faulting, or for a load doing the TLB/page-table checks will tell you it's non-faulting even if the data hasn't arrived yet). However, every branch instruction would need to be already executed (and known-correct), as would every ALU instruction like div that can.
Such a CPU also doesn't need to stop later loads from running before stores. A speculative load has no architectural effect / visibility, so it's ok if other cores see a share-request for a cache line which was the result of a mis-speculation. (On a memory region whose semantics allow that, such as normal WB write-back cacheable memory). That's why HW prefetching and speculative execution work in normal CPUs.
The memory model even allows StoreLoad ordering, so we're not speculating on memory ordering, only on the store (and other intervening instructions) not faulting. Which again is fine; speculative loads are always fine, it's speculative stores that we must not let other cores see. (So we can't do them at all if we don't have a store buffer or some other mechanism.)
(Fun fact: real x86 CPUs do speculate on memory ordering by doing loads out of order with each other, depending on addresses being ready or not, and on cache hit/miss. This can lead to memory order mis-speculation "machine clears" aka pipeline nukes (machine_clears.memory_ordering perf event) if another core wrote to a cache line between when it was actually read and the earliest the memory model said we could. Or even if we guess wrong about whether a load is going to reload something stored recently or not; memory disambiguation when addresses aren't ready yet involves dynamic prediction so you can provoke machine_clears.memory_ordering with single-threaded code.)
Out-of-order exec in P6 didn't introduce any new kinds of memory re-ordering because that could have broken existing multi-threaded binaries. (At that time mostly just OS kernels, I'd guess!) That's why early loads have to be speculative if done at all. x86's main reason for existence it backwards compat; back then it wasn't the performance king.
Re: why this litmus test exists at all, if that's what you mean?
Obviously to highlight something that can happen on x86.
Is StoreLoad reordering important? Usually it's not a problem; acquire / release synchronization is sufficient for most inter-thread communication about a buffer being ready to read, or more generally a lock-free queue. Or to implement mutexes. ISO C++ only guarantees that mutexes lock / unlock are acquire and release operations, not seq_cst.
It's pretty rare that an algorithm depends on draining the store buffer before a later load.
Say I somehow observed this litmus test on an x86 machine,
Fully working program that verifies that this reordering is possible in real life on real x86 CPUs: https://preshing.com/20120515/memory-reordering-caught-in-the-act/. (The rest of Preshing's articles on memory ordering are also excellent. Great for getting a conceptual understanding of inter-thread communication via lockless operations.)

Why do weak memory models exist and how is their instruction order selected?

CPUs such as ARM have the weak memory model. Assume we have two threads T1 and T2.
| T1 | T2 |
|---------|---------|
| Instr A | Instr C |
| Instr B | Instr D |
In a weak order any instruction can run at any time which mean "D -> A -> B -> C" is possible.
My first question is why is this beneficial? And my second question is how is the selection (optimization) done? is the CPU randomly picking them or are there algorithms behind it? Is the CPU doing the picking or there is another chip which is doing the work (memory chip or something)?

There is no global arbiter that would do any such thing. If there was, it would be as efficient to always do things in order.
The only data available immediately is local. Each execution takes decision based on rapidly available information.
There is no pressure to execute anything in reverse order rather than in written order. Reserve is not a priori better. But data for B might be available before data for A and then B might be executed first as waiting for A to complete would let computing resources unused.
So it's all a matter of having all data available when needed, and the delays of communication between processors. You could view that as a team effort to work cooperatively with people that can only exchange by very slow means of communication: they would get as much work done based on their locally available information. No central power would ever have an accurate picture of the state of latest done work.

Why do weak memory models exit?
For performance reasons. Weak memory models allow compiler and hardware optimization that improve system performance. The cost of enforcing a strong
memory model (sequential-consistency model) in compilation and hardware implementation is severe performance degradation.
What are the allowed instruction reorderings (how is the selection done)?
It is specific to each memory model. There are several weak memory models, and the instruction reordering rules are part of their specifications.
Instruction reordering is ubiquitously used in compiler and hardware optimizations to achieve higher performance. The basic premise for these optimizations is that the instructions can be reordered as long as the functional correctness of the program is preserved.
In a sequential (single-threaded) program, functional correctness can be guaranteed by simply ensuring that "two operations are executed in program order if they are accessing the same memory location and one of them is a write or if there is a data or control dependence between them."
For multithreaded programs, functional correctness also depends on the relative order of loads and stores to different memory locations in the same thread. It is the memory model specification that specifies the conditions under which two memory instructions can be reordered without affecting the functional correctness.

In addition to the above answers:
If there are no fences, the only ordering that needs to be preserved is the data dependency order. So on a single CPU a load of X should see the most recent store to X before it. But if instructions do not have any data dependency, they can be executed in any order.
Modern CPU's use out of order execution the maximize the amount of parallelism in the instruction stream. This way independent instructions can run in parallel and it prevents the CPU from stalling for memory access.
CPUs make use of other techniques like store buffers, load buffers, write coalescing etc. Which all can lead to loads and stores being executed out of order. This is fine, because it isn't visible to the core that executes these loads and stores. The problem is when the core is sharing memory with other cores; then these reorderings can become visible.
For Sequential Consistency (SC) no reordering is allowed; so all 4 fences need to be preserved -> [LoadLoad][LoadStore][StoreLoad][StoreStore].
On the X86, the store buffers can cause older stores to be reordered with newer loads to a different address; so the [StoreLoad] is dropped and SC only preserved [LoadLoad][LoadStore][StoreStore]. This memory model is called TSO (Total Store Order).
TSO can be relaxed by allowing writes from the same core to be reordered (e.g. write coalescing or store buffers that don't retire in order). This results in PMO (partial store order).
The problem with SC/TSO/PMO is that certain reordering aren't allowed and this can lead to reduced performance; imagine there are 2 independent loads on the same CPU, then these loads can't be reordered because of the [LoadLoad]. In practice this can be resolved by executing instructions speculatively and if an out of order load is detected, then flush the pipeline and start again. This makes CPU's more complex and less performant.
Models like SC, TSO, PMO are strong consistency models because ever load and every store has certain ordering semantics. But in a weakly ordered consistency model, there is a separation between a plain load/store (no ordering semantics) and synchronization actions e.g. an acquire load and release store that do provide ordering semantics. The weak memory model with acquire loads and release stores is called release-consistency.
The big advantage of these weak models is that they allow for a much higher degree of parallelism and simpler CPU design. It shifts the burden to the software.
In practice you normally program using a programming language/API that provides a certain memory model and it needs to make sure the compiler isn't violating the model and sufficient ordering is added to the hardware e.g. in the form of fences. If you have a look at Java or C11, and you are using it correctly, then the same code will run fine on a CPU with a strong memory model like an X86 and a CPU with a weak memory model like ARM.

What is the opposite of a "full memory barrier"?

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:
If we have the following instructions:
instruction 1
full_memory_barrier
instruction 2
Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.
But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?
If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:
instruction 1
memory_barrier_below_to_above
instruction 2
Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:
instruction 2
instruction 1
memory_barrier_below_to_above
But the following will be allowed (which makes this type of memory barrier pointless):
memory_barrier_below_to_above
instruction 2
instruction 1

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.
On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.
Barrier examples:
strong weak
x86 mfence none needed unless you're using NT stores
ARM dmb sy isb, dmb st, dmb ish, etc.
POWER hwsync lwsync, isync, ...
ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.
ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)
POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.
A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.
Jeff Preshing has an article about this, too: Acquire and Release semantics
The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.
This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.
You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.
What exactly does atomic_thread_fence(memory_order_release) do?
C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with #EOF in comments brought up the C11 version.)
This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says
3) A release fence A synchronizes with an atomic operation B that performs an acquire
operation on an atomic object M if there exists an atomic operation X such that A is
sequenced before X, X modifies M, and B reads the value written by X or a value written
by any side effect in the hypothetical release sequence X would head if it were a release
operation
So in the writing thread, it's talking about code like this:
stuff // including any non-atomic loads/stores
atomic_thread_fence(mo_release) // A
M=X // X
// threads that see load(M, acquire) == X also see stuff
The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.
This lets us say something about what is / isn't allowed:
It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.
It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).
It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)
As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.
Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).
CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
But with just barriers against reordering within a physical core, sequential consistency can be recovered.)
A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.
Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

How does the cache coherency protocol enforce atomicity?

I understand atomicity can be guaranteed on operations like xsub(), without using the LOCK prefix, by relying on the cache coherency protocol (MESI/MESIF).
1) How can the cache coherency protocol do this???
Its making me wonder if the cache coherency protocol can enforce atomicity, why do we need special atomic types/instructions etc?
2) If MOSI implements atomic instructions across multi-core systems then what is the purpose of LOCK? Legacy?
3) If MOSI implements atomic instructions and MOSI is used for all instructions- then why do atomic instructions cost so much? Surely the performance should be same as normal instructions.

Atomicity and Memory Ordering
For an operation to be atomic it must appear to be one undivided operation to any observer. That observer can be anything that can see the effect of the operation, whether its the thread does the operation, a different thread on the same processor a thread on different processor, or some component or device in the system. Observers that can't see the effect of the operation, whether the same thread, a different thread, or a device, don't affect whether the operation is atomic or not.
(Note that by processor I mean what Intel's documentation would call a logical processor. A system with two CPU sockets, each populated with a quad-core CPU with two logical processors per core would have a total of 16 processors.)
A related but different concept is memory ordering. Memory accesses are only sequentially consistent if they appear to an observer as happening in the order they occur in the program. This guarantee always applies then when the observer is the same thread as performed the operations. Other more limited guarantees of memory ordering are possible. A strong but not sequentially consistent ordering might guarantee many sorts of operations are ordered with respect to each other, but not all. A weak memory ordering provides no guarantees about how accesses appear to other threads.
Compilers and Atomicity
When you're writing a program in C or some other higher level language it may appear that certain operations are atomic and sequentially ordered, but the compiler only generally guarantees this when viewed from the same thread that performed those operations. However, from the compiler's perspective any code that runs when a thread is asynchronously interrupted happens in different thread of execution even if that code runs in the same OS thread. That means the code running in a signal handler or in a structured exception handler isn't guaranteed to see operations performed outside the the handler in the same thread as being atomic or sequentially consistent.
Because of the limited general guarantee the compiler is free to do things like implement what look to be atomic operations using multiple assembler instructions make them appear non-atomic to other observers. The compiler can also reorder memory accesses, even remove apparently redundant accesses entirely. It can do whatever optimizations it wants so long in the single uninterrupted thread case the program still behaves as if it were doing all those operations in program order.
In the multi-threaded case, or where signal or exception handlers a present, it's necessary to take special steps to inform the compiler where you need it to provide broader guarantees of atomicity and memory ordering. That's the purpose special atomic types and functions. Even if the CPU guarantees every instruction is atomic and every memory access is sequentially consistent to all other threads, the compiler doesn't.
Intel CPUs and Atomicity
Intel CPUs make it fairly easy for the compiler to provide these guarantees. Except for some odd cases, instructions are uninterruptable. Any event that causes the execution of an instruction to be interrupted either happens after the instruction is fully completed or allows the instruction to resumed as if it were never executed. The means that at the machine code level every operation is atomic and every memory operation is sequentially consistent as it appears to code running on the same processor. In the single processor case nothing needs to be done provide these guarantees except when they need to be to visible to devices other than the processor. In that case the LOCK prefix combined with uncached memory regions must be used to guarantee read/modify/write instructions are atomic and memory accesses appear sequentially consistent to other devices.
In the multi-processor case when accessing cached memory the cache coherency protocol provides guarantees of atomicity with most instructions and a strong memory ordering but not a sequentially consistent ordering. The exact mechanism by which is does this doesn't matter much, just the guarantees is gives. Any instruction that only accesses a single memory location will appear atomic to other processors. The ordering guarantees are too long to go into here, Intel uses 16 bullet points to describe them, but they apparently its a superset the guarantees that C and C++ provide with the acquire and release memory order. When that level of memory ordering is specified, the C/C++ atomic operations can use ordinary unlocked instructions.
The need for the LOCK prefix, and those instructions where the LOCK prefix is implicit, comes when you need stronger guarantees than the cache coherency protocol provides. If you need your read/modifiy/write instructions to be atomic you need to use the LOCK prefix. If you need sequentially consistent ordering you need to use the LOCK prefix.
The LOCK prefix is where the high cost of atomic operations comes from. It causes the processor to wait for all previous load and store operations to complete. Even though when accessing cached memory the LOCK prefix handled entirely within the cache without asserting LOCK#, the processor still needs to wait to ensure the operation appears sequentially consistent to other processors.
Summary
So in summary the answers to your questions are:
The cache coherency protocol can only enforce atomicity of certain machine code instruction when viewed from other processors. It can't ensure that the compiler generates a single instruction for an operation you want to be atomic. It also can't guarantee that the instruction appears to be atomic to non-processor devices on the system.
The LOCK prefix is used on machine code instructions that
perform multiple memory accesses and need appear to be atomic to other processors
need to be sequentially consistent to other processors
need to be atomic and/or sequentially consistent to other non-processor devices.
When its possible to get the necessary atomicity and memory ordering guarantees without using the LOCK prefix, the instructions used are the same as ordinary instructions and so cost the same. Where LOCK prefix is needed to provide the necessary guarantees the cost of the instruction becomes much higher than a normal instruction.

There is no xsub instruction in x86, but there is an xadd ;)
You should read the section about the LOCK prefix in the Instruction Set Reference, and the section 8.1 LOCKED ATOMIC OPERATIONS in the Software Developer's Manual Volume 3A: System Programming Guide, Part 1.
The single CPU refers to a single core nowadays, with its own cache. When you have multiple caches for multiple cores (physically in the same or separate cpu chips) they use some cache coherency protocol. In case of MESI, the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and marks it modified, additionally locking it. If another core needs the cache line, it will do a read operation which the owner core will snoop and delay the answer until the atomic operation completes.
On single-cpu single-core systems, most instructions are atomic with respect to threading except for string instructions using a REP prefix because scheduling interrupts and thus context switches only happen on instruction boundaries. A hardware device could however observe non-atomic behaviour.

How is atomicity implemented by the CPU?

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string