How is atomicity implemented by the CPU?

How is atomicity implemented by the CPU? - multithreading

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

Related

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Does an x86 CPU reorder instructions?

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.
To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.
But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?

Reordering
Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.
Memory Barriers
That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.
Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.
So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.
As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.
So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).
x86 re-ordering
In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.
At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:
Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
Local load operations are never reordered with respect to other local load operations.
Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.
So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?
Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.
In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.
So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).
The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.
Jeff Preshing's Memory Reordering Caught in the Act
has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.
1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.

Will atomic operations block other threads?

I am trying to make "atomic vs non atomic" concept settled in my mind. My first problem is I could not find "real-life analogy" on that. Like customer/restaurant relationship over atomic operations or something similar.
Also I would like to learn about how atomic operations places themselves in thread-safe programming.
In this blog post; http://preshing.com/20130618/atomic-vs-non-atomic-operations/
it is mentioned as:
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
What is the meaning of "no other thread can observe the modification half-complete"?
That means thread will wait until atomic operation is done? How that thread know about that operation is atomic? For example in .NET I can understand if you lock the object you set a flag to block other threads. But what about atomic? How other threads know difference between atomic and non-atomic operations?
Also if above statement is true, do all atomic operations are thread-safe?

Let's clarify a bit what is atomic and what are blocks. Atomicity means that operation either executes fully and all it's side effects are visible, or it does not execute at all. So all other threads can either see state before the operation or after it. Block of code guarded by a mutex is atomic too, we just don't call it an operation. Atomic operations are special CPU instructions which conceptually are similar to usual operation guarded by a mutex (you know what mutex is, so I'll use it, despite the fact that it is implemented using atomic operations). CPU has a limited set of operations which it can execute atomically, but due to hardware support they are very fast.
When we discuss thread blocks we usually involve mutexes in conversation because code guarded by them can take quite a time to execute. So we say that thread waits on a mutex. For atomic operations situation is the same, but they are fast and we usually don't care for delays here, so it is not that likely to hear words "block" and "atomic operation" together.
That means thread will wait until atomic operation is done?
Yes it will wait. CPU will restrict access to a block of memory where the variable is located and other CPU cores will wait. Note that for performance reasons that blocks are held only between atomic operations themselves. CPU cores are allowed to cache variables for read.
How that thread know about that operation is atomic?
Special CPU instructions are used. It is just written in your program that particular operation should be performed in atomic manner.
Additional information:
There are more tricky parts with atomic operations. For example on modern CPUs usually all reads and writes of primitive types are atomic. But CPU and compiler are allowed to reorder them. So it is possible that you change some struct, set a flag that telling that it is changed, but CPU reorders writes and sets flag before the struct is actually committed to memory. When you use atomic operations usually some additional efforts are done to prevent undesired reordering. If you want to know more, you should read about memory barriers.
Simple atomic stores and writes are not that useful. To make maximal use of atomic operations you need something more complex. Most common is a CAS - compare and swap. You compare variable with a value and change it only if comparison was successful.

On typical modern CPUs, atomic operations are made atomic this way:
When an instruction is issued that accesses memory, the core's logic attempts to put the core's cache in the correct state to access that memory. Typically, this state will be achieved before the memory access has to happen, so there is no delay.
While another core is performing an atomic operation on a chunk of memory, it locks that memory in its own cache. This prevents any other core from acquiring the right to access that memory until the atomic operation completes.
Unless two cores happen to be performing accesses to many of the same areas of memory and many of those accesses are writes, this typically won't involve any delays at all. That's because the atomic operation is very fast and typically the core knows in advance what memory it will need access to.
So, say a chunk of memory was last accessed on core 1 and now core 2 wants to do an atomic increment. When the core's prefetch logic sees the modification to that memory in the instruction stream, it will direct the cache to acquire that memory. The cache will use the intercore bus to take ownership of that region of memory from core 1's cache and it will lock that region in its own cache.
At this point, if another core tries to read or modify that region of memory, it will be unable to acquire that region in its cache until the lock is released. This communication takes place on the bus that connects the caches and precisely where it takes place depends on which cache(s) the memory was in. (If not in cache at all, then it has to go to main memory.)
A cache lock is not normally described as blocking a thread both because it is so fast and because the core is usually able to do other things while it's trying to acquire the memory region that is locked in the other cache. From the point of view of the higher-level code, the implementation of atomics is typically considered an implementation detail.
All atomic operations provide the guarantee that an intermediate result will not be seen. That's what makes them atomic.

The atomic operations you describe are instructions within the processor and the hardware will make sure that a read cannot happen on a memory location until the atomic write is complete. This guarantees that a thread either reads the value before write or the value after the write operation, but nothing in-between - there's no chance of reading half of the bytes of the value from before the write and the other half from after the write.
Code running against the processor is not even aware of this block but it's really no different from using a lock statement to make sure that a more complex operation (made up of many low-level instructions) is atomic.
A single atomic operation is always thread-safe - the hardware guarantees that the effect of the operation is atomic - it'll never get interrupted in the middle.
A set of atomic operations is not atomic in the vast majority of cases (I'm not an expert so I don't want to make a definitive statement but I can't think of a case where this would be different) - this is why locking is needed for complex operations: the entire operation may be made up of multiple atomic instructions but the whole of the operation may still be interrupted between any of those two instructions, creating the possibility of another thread seeing half-baked results. Locking ensures that code operating on shared data cannot access that data until the other operation completes (possibly over several thread switches).
Some examples are shown in this question / answer but you find many more by searching.

Being "atomic" is an attribute that applies to an operation which is enforced by the implementation (either the hardware or the compiler, generally speaking). For a real-life analogy, look to systems requiring transactions, such as bank accounts. A transfer from one account to another involves a withdrawal from one account and a deposit to another, but generally these should be performed atomically - there is no time when the money has been withdrawn but not yet deposited, or vice versa.
So, continuing the analogy for your question:
What is the meaning of "no other thread can observe the modification half-complete"?
This means that no thread could observe the two accounts in a state where the withdrawal had been made from one account but it had not been deposited in another.
In machine terms, it means that an atomic read of a value in one thread will not see a value with some bits from before an atomic write by another thread, and some bits from after the same write operation. Various operations more complex than just a single read or write can also be atomic: for instance, "compare and swap" is a commonly implemented atomic operation that checks the value of a variable, compares it to a second value, and replaces it with another value if the compared values were equal, atomically - so for instance, if the comparison succeeds, it is not possible for another thread to write a different value in between the compare and the swap parts of the operation. Any write by another thread will either be performed wholly before or wholly after the atomic compare-and-swap.
The title to your question is:
Will atomic operations block other threads?
In the usual meaning of "block", the answer is no; an atomic operation in one thread won't by itself cause execution to stop in another thread, although it may cause a livelock situation or otherwise prevent progress.
That means thread will wait until atomic operation is done?
Conceptually, it means that they will never need to wait. The operation is either done, or not done; it is never halfway done. In practice, atomic operations can be implemented using mutexes, at a significant performance cost. Many (if not most) modern processors support various atomic primitives at the hardware level.
Also if above statement is true, do all atomic operations are thread-safe?
If you compose atomic operations, they are no longer atomic. That is, I can do one atomic compare-and-swap operation followed by another, and the two compare-and-swaps will individually be atomic, but they are divisible. Thus you can still have concurrency errors.

Atomic operation means the system performs an operation in its entirety or not at all. Reading or writing an int64 is atomic (64bits System & 64bits CLR) because the system read/write the 8 bytes in one single operation, readers do not see half of the new value being stored and half of the old value. But be carefull :
long n = 0; // writing 'n' is atomic, 64bits OS & 64bits CLR
long m = n; // reading 'n' is atomic
....// some code
long o = n++; // is not atomic : n = n + 1 is doing a read then a write in 2 separate operations
To make atomicity happens to the n++ you can use the Interlocked API :
long o = Interlocked.Increment(ref n); // other threads are blocked while the atomic operation is running

When can the CPU ignore the LOCK prefix and use cache coherency?

I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructions. If I was performing a fetch, modify, write combination of instructions, MESI-alone wouldn't be able to enforce atomicity across the first instruction to the last.
However, section 8 of the Intel reference manual Vol 3a says:
8.1.4 Effects of a LOCK Operation on Internal Processor Caches
For the P6 and more recent processor families, if the area of memory
being locked during a LOCK operation is cached in the processor that
is performing the LOCK operation as write-back memory and is
completely contained in a cache line, the processor may not assert the
LOCK# signal on the bus. Instead, it will modify the memory location
internally and allow it’s cache coherency mechanism to ensure that the
operation is carried out atomically. This operation is called “cache
locking.” The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from
simultaneously modifying data in that area.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
This seems to contradict my understanding by implying the LOCK instruction doesn't need to be used as cache coherency can be used?

There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.
MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.
On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.
By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr] for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).

Reading the excerpt you give, I don't find it contradictory to using of LOCK-ed instruction. For example, consider INC instruction. Without the LOCK, it can read the original value having its cache line in SHARED state which does not prevent other cores on the same cache from concurrent reading of the same value before storing the same incremented result = data race.
I interpret the quote as the data integrity is guaranteed per cache line granularity, the additional care may not be necessary when the data fits one cache line. But if the the data crosses the boundary of two cache lines, it is necessary to assert that modifications for both of them will be treated atomically.

How does the cache coherency protocol enforce atomicity?

I understand atomicity can be guaranteed on operations like xsub(), without using the LOCK prefix, by relying on the cache coherency protocol (MESI/MESIF).
1) How can the cache coherency protocol do this???
Its making me wonder if the cache coherency protocol can enforce atomicity, why do we need special atomic types/instructions etc?
2) If MOSI implements atomic instructions across multi-core systems then what is the purpose of LOCK? Legacy?
3) If MOSI implements atomic instructions and MOSI is used for all instructions- then why do atomic instructions cost so much? Surely the performance should be same as normal instructions.

Atomicity and Memory Ordering
For an operation to be atomic it must appear to be one undivided operation to any observer. That observer can be anything that can see the effect of the operation, whether its the thread does the operation, a different thread on the same processor a thread on different processor, or some component or device in the system. Observers that can't see the effect of the operation, whether the same thread, a different thread, or a device, don't affect whether the operation is atomic or not.
(Note that by processor I mean what Intel's documentation would call a logical processor. A system with two CPU sockets, each populated with a quad-core CPU with two logical processors per core would have a total of 16 processors.)
A related but different concept is memory ordering. Memory accesses are only sequentially consistent if they appear to an observer as happening in the order they occur in the program. This guarantee always applies then when the observer is the same thread as performed the operations. Other more limited guarantees of memory ordering are possible. A strong but not sequentially consistent ordering might guarantee many sorts of operations are ordered with respect to each other, but not all. A weak memory ordering provides no guarantees about how accesses appear to other threads.
Compilers and Atomicity
When you're writing a program in C or some other higher level language it may appear that certain operations are atomic and sequentially ordered, but the compiler only generally guarantees this when viewed from the same thread that performed those operations. However, from the compiler's perspective any code that runs when a thread is asynchronously interrupted happens in different thread of execution even if that code runs in the same OS thread. That means the code running in a signal handler or in a structured exception handler isn't guaranteed to see operations performed outside the the handler in the same thread as being atomic or sequentially consistent.
Because of the limited general guarantee the compiler is free to do things like implement what look to be atomic operations using multiple assembler instructions make them appear non-atomic to other observers. The compiler can also reorder memory accesses, even remove apparently redundant accesses entirely. It can do whatever optimizations it wants so long in the single uninterrupted thread case the program still behaves as if it were doing all those operations in program order.
In the multi-threaded case, or where signal or exception handlers a present, it's necessary to take special steps to inform the compiler where you need it to provide broader guarantees of atomicity and memory ordering. That's the purpose special atomic types and functions. Even if the CPU guarantees every instruction is atomic and every memory access is sequentially consistent to all other threads, the compiler doesn't.
Intel CPUs and Atomicity
Intel CPUs make it fairly easy for the compiler to provide these guarantees. Except for some odd cases, instructions are uninterruptable. Any event that causes the execution of an instruction to be interrupted either happens after the instruction is fully completed or allows the instruction to resumed as if it were never executed. The means that at the machine code level every operation is atomic and every memory operation is sequentially consistent as it appears to code running on the same processor. In the single processor case nothing needs to be done provide these guarantees except when they need to be to visible to devices other than the processor. In that case the LOCK prefix combined with uncached memory regions must be used to guarantee read/modify/write instructions are atomic and memory accesses appear sequentially consistent to other devices.
In the multi-processor case when accessing cached memory the cache coherency protocol provides guarantees of atomicity with most instructions and a strong memory ordering but not a sequentially consistent ordering. The exact mechanism by which is does this doesn't matter much, just the guarantees is gives. Any instruction that only accesses a single memory location will appear atomic to other processors. The ordering guarantees are too long to go into here, Intel uses 16 bullet points to describe them, but they apparently its a superset the guarantees that C and C++ provide with the acquire and release memory order. When that level of memory ordering is specified, the C/C++ atomic operations can use ordinary unlocked instructions.
The need for the LOCK prefix, and those instructions where the LOCK prefix is implicit, comes when you need stronger guarantees than the cache coherency protocol provides. If you need your read/modifiy/write instructions to be atomic you need to use the LOCK prefix. If you need sequentially consistent ordering you need to use the LOCK prefix.
The LOCK prefix is where the high cost of atomic operations comes from. It causes the processor to wait for all previous load and store operations to complete. Even though when accessing cached memory the LOCK prefix handled entirely within the cache without asserting LOCK#, the processor still needs to wait to ensure the operation appears sequentially consistent to other processors.
Summary
So in summary the answers to your questions are:
The cache coherency protocol can only enforce atomicity of certain machine code instruction when viewed from other processors. It can't ensure that the compiler generates a single instruction for an operation you want to be atomic. It also can't guarantee that the instruction appears to be atomic to non-processor devices on the system.
The LOCK prefix is used on machine code instructions that
perform multiple memory accesses and need appear to be atomic to other processors
need to be sequentially consistent to other processors
need to be atomic and/or sequentially consistent to other non-processor devices.
When its possible to get the necessary atomicity and memory ordering guarantees without using the LOCK prefix, the instructions used are the same as ordinary instructions and so cost the same. Where LOCK prefix is needed to provide the necessary guarantees the cost of the instruction becomes much higher than a normal instruction.

There is no xsub instruction in x86, but there is an xadd ;)
You should read the section about the LOCK prefix in the Instruction Set Reference, and the section 8.1 LOCKED ATOMIC OPERATIONS in the Software Developer's Manual Volume 3A: System Programming Guide, Part 1.
The single CPU refers to a single core nowadays, with its own cache. When you have multiple caches for multiple cores (physically in the same or separate cpu chips) they use some cache coherency protocol. In case of MESI, the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and marks it modified, additionally locking it. If another core needs the cache line, it will do a read operation which the owner core will snoop and delay the answer until the atomic operation completes.
On single-cpu single-core systems, most instructions are atomic with respect to threading except for string instructions using a REP prefix because scheduling interrupts and thus context switches only happen on instruction boundaries. A hardware device could however observe non-atomic behaviour.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string