Do spinlocks really need DMB?

Do spinlocks really need DMB? - linux

I'm working with a dual Cortex-A9 system and I've been trying to
understand exactly why spinlock functions need to use DMB. It seems
that as long as the merging store buffer is flushed the lock value
should end up in the L1 on the unlocking core and the SCU should
either invalidate or update the value in the L1 of the other core.
This is enough to maintain coherency and safe locking right? And
doesn't STREX skip the merging store buffer anyway, meaning we don't
even need the flush?
DMB appears to be something of a blunt hammer, especially since it
defaults to the system domain, which likely means a write all the way
to main memory, which can be expensive.
Are the DMBs in the locks there as a workaround for drivers that don't
use smp_mb properly?
I'm currently seeing, based on the performance counters, about 5% of
my system cycles disappearing in stalls caused by DMB.

I found these articles may answer your question:
Locks, SWPs and two Smoking Barriers
Locks, SWPs and two Smoking Barriers (Part 2)
In particular:
You will note the Data Memory Barrier (DMB) instruction that is issued once the lock has been acquired. The DMB guarantees that all memory accesses before the memory barrier will be observed by all of the other CPUs in the system before all memory accesses made after the memory barrier. This makes more sense if you consider that once a lock has been acquired, a program will then access the data structure(s) locked by the lock. The DMB in the lock function above ensures that accesses to the locked data structure are observed after accesses to the lock.

The DMB is needed in the SMP case because the other processor may see the memory accesses happening in a different order without it, i.e. accesses from inside the critical section may happen before the lock is taken from the point-of-view of the second core.
So the second core could see itself holding the lock and also see updates from inside the cricital section running on the other core, breaking consistency.

Related

Does a mutex lock guarantee that a thread will always store updated values into the main memory?

a. Does accessing a memory location with a mutex lock mean that whatever the critical code is doing to the mutexed-variables will end up into the main memory, and not only updated inside the thread's cache or registers without a fresh copy of values in the main memory?
b. If that's the case, aren't we effectively running the critical core as if we didn't have a cache (at least no cache locations for mutex-lock variables)?
c. And if that is the case then isn't the critical code a heavy weight code, and needs to be as small as possible, considering the continued need to read from and write into the main memory at least at the beginning and end of the mutex-locking session?

a. Does accessing a memory location with a mutex lock mean that whatever the critical code is doing to the mutexed-variables will end up into the main memory, and not only updated inside the thread's cache or registers without a fresh copy of values in the main memory?
A correctly implemented mutex guarantees that previous writes are visible to other agents (e.g. other CPUs) when the mutex is released. On systems with cache coherency (e.g. 80x86) modifications are visible when they're in a cache and it doesn't matter if modifications have reached main memory.
Essentially (over-simplified), for cache coherency, when the other CPU wants the modified data it broadcasts a request (like "Hey, I want the data at address 123456"), and if its in another CPUs' cache the other CPU responds with "Here's the data you wanted", and if the data isn't in any cache the memory controller responds with "Here's the data you wanted"; and the CPU gets the most recent version of the data regardless of where the data was or what responds to the request. In practice it's a lot more complex - I'd recommend reading about the MESI cache control protocol if you're interested ( https://en.wikipedia.org/wiki/MESI_protocol ).
b. If that's the case, aren't we effectively running the critical core as if we didn't have a cache (at least no cache locations for mutex-lock variables)?
If it is the case (e.g. if there's no cache coherency); something (the code to release a mutex) would have to ensure that modified data is written back to RAM before the mutex can be acquired by something else. This doesn't prevent the cache from being used inside the critical section (e.g. critical section could write to cache, and then the modified data can be sent from cache to RAM after).
The cost would depend on various factors (CPU speed, cache speed and memory speed, and whether the cache is "write back" or "write through", and how much data is modified). For some cases (relatively slow CPU with write-through caches) the cost may be almost nothing.
c. And if that is the case then isn't the critical code a heavy weight code, and needs to be as small as possible, considering the continued need to read from and write into the main memory at least at the beginning and end of the mutex-locking session?
It's not as heavy as not using caches.
Synchronizing access (regardless of how its done) is always going to be more expensive than not synchronizing access (and crashing because all your data got messed up). ;-)
One of the challenges of multi-threaded code is finding a good compromise between the cost of synchronization and parallelism - a small number of locks (or a single global lock) reduces the cost of synchronization but limits parallelism (threads getting nothing done waiting to acquire a lock); and a large number of locks increases the cost of synchronization (e.g. acquiring more locks is more expensive than acquiring one) but allows more parallelism.
Of course parallelism is also limited by the number of CPUs you have; which means that a good compromise for one system (with few CPUs) may not be a good compromise on another system (with lots of CPUs).

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Memory barriers force cache coherency?

I was reading this question about using a bool for thread control and got intrigued by this answer by #eran:
Using volatile is enough only on single cores, where all threads use the same cache. On multi-cores, if stop() is called on one core and run() is executing on another, it might take some time for the CPU caches to synchronize, which means two cores might see two different views of isRunning_.
If you use synchronization mechanisms, they will ensure all caches get the same values, in the price of stalling the program for a while. Whether performance or correctness is more important to you depends on your actual needs.
I have spent over an hour searching for some statement that says synchronization primitives force cache coherency but have failed. The closest I have come is Wikipedia:
The keyword volatile does not guarantee a memory barrier to enforce cache-consistency.
Which suggests that memory barriers do force cache consistency, and since some synchronization primitives are implemented using memory barriers (again from Wikipedia) this is some "evidence".
But I don't know enough to be certain whether to believe this or not, and be sure that I'm not misinterpreting it.
Can someone please clarify this?

Short Answer : Cache coherency works most of the time but not always. You can still read stale data. If you don't want to take chances, then just use a memory barrier
Long Answer : CPU core is no longer directly connected to the main memory. All loads and stores have to go through the cache. The fact that each CPU has its own private cache causes new problems. If more than one CPU is accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet to main memory) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. . Instead the content of the first processor’s cacheline is needed. The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor’s cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor’s cache? Assuming it just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Here comes cache coherency protocols. CPU's maintain data consistency across their caches via MESI or some other cache coherence protocol.
With cache coherency in place, should we not see that latest value always for the cacheline even if it was modified by another CPU? After all that is whole purpose of the cache coherency protocols. Usually when a cacheline is modified, the corresponding CPU sends an "invalidate cacheline" request to all other CPU's. It turns out that CPU’s can send acknowledgement to the invalidate requests immediately but defer the actual invalidation of the cacheline to a later point in time. This is done via invalidation queues. Now if we get un-lucky enough to read the cacheline within this short window (between the CPU acknowledging an invalidation request and actually invalidating the cacheline) then we can read a stale value. Now why would a CPU do such a horrible thing. The simple answer is PERFORMANCE. So lets look into different scenarios where invalidation queues can improve performance
Scenario 1 : CPU1 receives an invalidation request from CPU2. CPU1 also has a lot of stores and loads queued up for the cache. This means that the invalidation of the requested cacheline takes times and CPU2 gets stalled waiting for the acknowledgment
Scenario 2 : CPU1 receives a lot of invalidation requests in a short amount of time. Now it takes time for CPU1 to invalidate all the cachelines.
Placing an entry into the invalidate queue is essentially a promise by the CPU to process that entry before transmitting any MESI protocol messages regarding that cache line. So invalidation queues are the reason why we may not see the latest value even when doing a simple read of a single variable.
Now the keen reader might be thinking, when the CPU wants to read a cacheline, it could scan the invalidation queue first before reading from the cache. This should avoid the problem. However the CPU and invalidation queue are physically placed on opposite sides of the cache and this limits the CPU from directly accessing the invalidation queue. (Invalidation queues of one CPU’s cache are populated by cache coherency messages from other CPU’s via the system bus. So it kind of makes sense for the invalidation queues to be placed between the cache and the system bus). So in order to actually see the latest value of any shared variable, we should empty the invalidation queue. Usually a read memory barrier does that.
I just talked about invalidation queues and read memory barriers. [1] is a good reference for understanding the need for read and write memory barriers and details of MESI cache coherency protocol
[1] http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

As I understand, synchronization primitives won't affect cache coherency at all. Cache is French for hidden, it's not supposed to be visible to the user. A cache coherency protocol should work without the programmer's involvement.
Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.
A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.
EDIT: To clarify your doubt
The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.
The volatile keyword in C means that the variable is always read from memory (as opposed to a register) and that the compiler won't reorder loads/stores around it. It doesn't mean the hardware won't reorder the loads/stores. This is a memory consistency problem. When using weaker consistency models the programmer is required to use synchronization primitives to enforce a specific ordering. This is not the same as cache coherency. For example, if thread 1 modifies location A, then after this event thread 2 loads location A, it will receive an updated (consistent) value. This should happen automatically if cache coherency is used. Memory ordering is a different problem. You can check out the famous paper Shared Memory Consistency Models: A Tutorial for more information. One of the better known examples is Dekker's Algorithm which requires sequential consistency or synchronization primitives.
EDIT2: I would like to clarify one thing. While my cache coherency example is correct, there is a situation where memory consistency might seem to overlap with it. This when stores are executed in the processor but delayed going to the cache (they are in a store queue/buffer). Since the processor's cache hasn't received an updated value, the other caches won't either. This may seem like a cache coherency problem but in reality it is not and is actually part of the memory consistency model of the ISA. In this case synchronization primitives can be used to flush the store queue to the cache. With this in mind, the Wikipedia text that you highlighted in bold is correct but this other one is still slightly wrong: The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. It should say: The keyword volatile does not guarantee a memory barrier to enforce memory consistency.

What wikipedia tells you is that volatile does not mean that a memory barrier will be inserted to enforce cache-consistency. A proper memory barrier will however enforce that memory access between multiple CPU cores is consistent, you may find reading the std::memory_order documentation helpful.

How does the cache coherency protocol enforce atomicity?

I understand atomicity can be guaranteed on operations like xsub(), without using the LOCK prefix, by relying on the cache coherency protocol (MESI/MESIF).
1) How can the cache coherency protocol do this???
Its making me wonder if the cache coherency protocol can enforce atomicity, why do we need special atomic types/instructions etc?
2) If MOSI implements atomic instructions across multi-core systems then what is the purpose of LOCK? Legacy?
3) If MOSI implements atomic instructions and MOSI is used for all instructions- then why do atomic instructions cost so much? Surely the performance should be same as normal instructions.

Atomicity and Memory Ordering
For an operation to be atomic it must appear to be one undivided operation to any observer. That observer can be anything that can see the effect of the operation, whether its the thread does the operation, a different thread on the same processor a thread on different processor, or some component or device in the system. Observers that can't see the effect of the operation, whether the same thread, a different thread, or a device, don't affect whether the operation is atomic or not.
(Note that by processor I mean what Intel's documentation would call a logical processor. A system with two CPU sockets, each populated with a quad-core CPU with two logical processors per core would have a total of 16 processors.)
A related but different concept is memory ordering. Memory accesses are only sequentially consistent if they appear to an observer as happening in the order they occur in the program. This guarantee always applies then when the observer is the same thread as performed the operations. Other more limited guarantees of memory ordering are possible. A strong but not sequentially consistent ordering might guarantee many sorts of operations are ordered with respect to each other, but not all. A weak memory ordering provides no guarantees about how accesses appear to other threads.
Compilers and Atomicity
When you're writing a program in C or some other higher level language it may appear that certain operations are atomic and sequentially ordered, but the compiler only generally guarantees this when viewed from the same thread that performed those operations. However, from the compiler's perspective any code that runs when a thread is asynchronously interrupted happens in different thread of execution even if that code runs in the same OS thread. That means the code running in a signal handler or in a structured exception handler isn't guaranteed to see operations performed outside the the handler in the same thread as being atomic or sequentially consistent.
Because of the limited general guarantee the compiler is free to do things like implement what look to be atomic operations using multiple assembler instructions make them appear non-atomic to other observers. The compiler can also reorder memory accesses, even remove apparently redundant accesses entirely. It can do whatever optimizations it wants so long in the single uninterrupted thread case the program still behaves as if it were doing all those operations in program order.
In the multi-threaded case, or where signal or exception handlers a present, it's necessary to take special steps to inform the compiler where you need it to provide broader guarantees of atomicity and memory ordering. That's the purpose special atomic types and functions. Even if the CPU guarantees every instruction is atomic and every memory access is sequentially consistent to all other threads, the compiler doesn't.
Intel CPUs and Atomicity
Intel CPUs make it fairly easy for the compiler to provide these guarantees. Except for some odd cases, instructions are uninterruptable. Any event that causes the execution of an instruction to be interrupted either happens after the instruction is fully completed or allows the instruction to resumed as if it were never executed. The means that at the machine code level every operation is atomic and every memory operation is sequentially consistent as it appears to code running on the same processor. In the single processor case nothing needs to be done provide these guarantees except when they need to be to visible to devices other than the processor. In that case the LOCK prefix combined with uncached memory regions must be used to guarantee read/modify/write instructions are atomic and memory accesses appear sequentially consistent to other devices.
In the multi-processor case when accessing cached memory the cache coherency protocol provides guarantees of atomicity with most instructions and a strong memory ordering but not a sequentially consistent ordering. The exact mechanism by which is does this doesn't matter much, just the guarantees is gives. Any instruction that only accesses a single memory location will appear atomic to other processors. The ordering guarantees are too long to go into here, Intel uses 16 bullet points to describe them, but they apparently its a superset the guarantees that C and C++ provide with the acquire and release memory order. When that level of memory ordering is specified, the C/C++ atomic operations can use ordinary unlocked instructions.
The need for the LOCK prefix, and those instructions where the LOCK prefix is implicit, comes when you need stronger guarantees than the cache coherency protocol provides. If you need your read/modifiy/write instructions to be atomic you need to use the LOCK prefix. If you need sequentially consistent ordering you need to use the LOCK prefix.
The LOCK prefix is where the high cost of atomic operations comes from. It causes the processor to wait for all previous load and store operations to complete. Even though when accessing cached memory the LOCK prefix handled entirely within the cache without asserting LOCK#, the processor still needs to wait to ensure the operation appears sequentially consistent to other processors.
Summary
So in summary the answers to your questions are:
The cache coherency protocol can only enforce atomicity of certain machine code instruction when viewed from other processors. It can't ensure that the compiler generates a single instruction for an operation you want to be atomic. It also can't guarantee that the instruction appears to be atomic to non-processor devices on the system.
The LOCK prefix is used on machine code instructions that
perform multiple memory accesses and need appear to be atomic to other processors
need to be sequentially consistent to other processors
need to be atomic and/or sequentially consistent to other non-processor devices.
When its possible to get the necessary atomicity and memory ordering guarantees without using the LOCK prefix, the instructions used are the same as ordinary instructions and so cost the same. Where LOCK prefix is needed to provide the necessary guarantees the cost of the instruction becomes much higher than a normal instruction.

There is no xsub instruction in x86, but there is an xadd ;)
You should read the section about the LOCK prefix in the Instruction Set Reference, and the section 8.1 LOCKED ATOMIC OPERATIONS in the Software Developer's Manual Volume 3A: System Programming Guide, Part 1.
The single CPU refers to a single core nowadays, with its own cache. When you have multiple caches for multiple cores (physically in the same or separate cpu chips) they use some cache coherency protocol. In case of MESI, the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and marks it modified, additionally locking it. If another core needs the cache line, it will do a read operation which the owner core will snoop and delay the answer until the atomic operation completes.
On single-cpu single-core systems, most instructions are atomic with respect to threading except for string instructions using a REP prefix because scheduling interrupts and thus context switches only happen on instruction boundaries. A hardware device could however observe non-atomic behaviour.

How is atomicity implemented by the CPU?

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string