Does atomic_cmpxchg() imply memory barriers?

Does atomic_cmpxchg() imply memory barriers? - multithreading

The following two citations seem contradicting:
https://www.kernel.org/doc/Documentation/atomic_ops.txt
int atomic_cmpxchg(atomic_t *v, int old, int new);
This performs an atomic compare exchange operation on the atomic value
v, with the given old and new values. Like all atomic_xxx operations,
atomic_cmpxchg will only satisfy its atomicity semantics as long as
all other accesses of *v are performed through atomic_xxx operations.
atomic_cmpxchg requires explicit memory barriers around the operation.
vs
https://www.kernel.org/doc/Documentation/memory-barriers.txt
Any atomic operation that modifies some state in memory and returns
information about the state (old or new) implies an SMP-conditional
general memory barrier (smp_mb()) on each side of the actual operation
(with the exception of explicit lock operations, described later).
These include:
<...>
atomic_xchg();
atomic_cmpxchg();
<...>
These are used for such things as implementing LOCK-class and UNLOCK-class operations and adjusting reference
counters towards object destruction, and as such the implicit memory
barrier effects are necessary.
So should one put memory barriers around atomic_xchg() manually?

I'm not aware yet about Linux kernel programming specifics, so here is a partial (general) answer.
On x86, this operation carries full memory fence with it, there is no need in mfence/lfence/sfence around cmpxchg op.
On other architectures with relaxed memory model, it can be coupled with other memory semantics, e.g. "release", depending on how atomic_cmpxchg() is translated to the op codes.
It's on the processor side of things. However, there is compiler which can also reorder the operations, so if compiler barrier is not implied by atomic_cmpxchg() (by e.g. __asm__ __volatile__), you would need one.

Related

Why use an AtomicU32 in Rust, given that U32 already implements Sync?

The std::sync::atomic module contains a number of atomic variants of primitive types, with the stated purpose that these types are now thread-safe. However, all the primatives that correspond to the atomic types already implement Send and Sync, and should therefore already be thread-safe. What's the reasoning behind the Atomic types?

Generally, non-atomic integers are safe to share across threads because they're immutable. If you attempt to modify the value, you implicitly create a new one in most cases because they're Copy. However, it isn't safe to share a mutable reference to a u32 across threads (or have both mutable and immutable references to the same value), which practically means that you won't be able to modify the variable and have another thread see the results. An atomic type has some additional behavior which makes it safe.
In the more general case, using non-atomic operations doesn't guarantee that a change made in one thread will be visible in another. Many architectures, especially RISC architectures, do not guarantee that behavior without additional instructions.
In addition, compilers often reorder accesses to memory in functions and in some cases, across functions, and an atomic type with an appropriate barrier is required to indicate to the compiler that such behavior is not wanted.
Finally, atomic operations are often required to logically update the contents of a variable. For example, I may want to atomically add 1 to a variable. On a load-store architecture such as ARM, I cannot modify the contents of memory with an add instruction; I can only perform arithmetic on registers. Consequently, an atomic add is multiple instructions, usually consisting of a load-linked, which loads a memory location, the add operation on the register, and then a store-conditional, which stores the value if the memory location has not changed. There's also a loop to retry if it has.
These are why atomic operations are needed and generally useful across languages. So while one can use non-atomic operations in non-Rust languages, they don't generally produce useful results, and since one typically wants one's code to function correctly, atomic operations are desirable for correctness. Rust's atomic types guarantee this behavior by generating suitable instructions and therefore can be safely shared across threads.

What is the opposite of a "full memory barrier"?

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:
If we have the following instructions:
instruction 1
full_memory_barrier
instruction 2
Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.
But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?
If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:
instruction 1
memory_barrier_below_to_above
instruction 2
Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:
instruction 2
instruction 1
memory_barrier_below_to_above
But the following will be allowed (which makes this type of memory barrier pointless):
memory_barrier_below_to_above
instruction 2
instruction 1

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.
On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.
Barrier examples:
strong weak
x86 mfence none needed unless you're using NT stores
ARM dmb sy isb, dmb st, dmb ish, etc.
POWER hwsync lwsync, isync, ...
ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.
ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)
POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.
A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.
Jeff Preshing has an article about this, too: Acquire and Release semantics
The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.
This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.
You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.
What exactly does atomic_thread_fence(memory_order_release) do?
C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with #EOF in comments brought up the C11 version.)
This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says
3) A release fence A synchronizes with an atomic operation B that performs an acquire
operation on an atomic object M if there exists an atomic operation X such that A is
sequenced before X, X modifies M, and B reads the value written by X or a value written
by any side effect in the hypothetical release sequence X would head if it were a release
operation
So in the writing thread, it's talking about code like this:
stuff // including any non-atomic loads/stores
atomic_thread_fence(mo_release) // A
M=X // X
// threads that see load(M, acquire) == X also see stuff
The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.
This lets us say something about what is / isn't allowed:
It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.
It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).
It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)
As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.
Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).
CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
But with just barriers against reordering within a physical core, sequential consistency can be recovered.)
A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.
Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

How does CAS instructions guarantee atomicity

According to Wiki, CAS do something like this:
function cas(p : pointer to int, old : int, new : int) returns bool {
if *p ≠ old {
return false
}
*p ← new
return true
}
Well, it seems for me that if several processors will try to execute CAS instruction with the same arguments, there can be several write attempts at the same time so it's not safe to do it anyway.
Where am I wrong?

Atomic read-compare-write instructions from multiple cores at the same time (on the same cache line) do contend with each other, but it's up to hardware to sort that out. Hardware arbitration of atomic RMW instructions is a real thing in modern CPUs, and provides some degree of fairness so that one thread spinning on lock cmpxchg can't totally block other threads doing the same thing.
(Although that's a bad design unless your retry could succeed without waiting for another thread to modify anything, e.g. a retry loop that implements fetch_or or similar can try again with the updated value of expected. But if waiting for a lock or flag to change, after the initial CAS fails, it's better to spin on an acquire or relaxed load and only do the CAS if it might succeed.)
There's no guarantee what order they happen in, which is why you need to carefully design your algorithm so that correctness only depends on that compare-and-exchange being atomic. (The ABA problem is a common pitfall).
BTW, that entire block of pseudocode happens as a single atomic operation. Making a read-compare-write or read-modify-write happen as a single atomic operation is much harder for the hardware than just stores, which MESIF/MOESI handle just fine.
are you sure? I thought that it's unsafe to do that because, for example, x86 doesn't guarantee atomicity of writes for non-aligned DWORDs
lock cmpxchg makes the operation atomic regardless of alignment. It's potentially a lot slower for unaligned, especially on cache-line splits where atomically modifying a single cache line isn't enough.
See also Atomicity on x86 where I explain what it means for an operation to be atomic.

If you read the wiki it says that CAS "is an atomic version of the following pseudocode" of the code you posted. Atomic means that the code will execute without interruptions from other threads. So even if several threads try to execute this code at the same time with the same arguments (like you suggest) only one of them will return true, because in practice they will not execute simultaneously since the atomicity require they run in isolation.
And since you mention "x86 doesn't guarantee atomicity of writes for non-aligned DWORDs", this is not an issue here either because the atomic property of the cas function.

Can atomic ops based spin lock's Unlock directly set the lock flag to zero?

Say for example, I have an exclusive atomic-ops-based spin lock implementation as below:
bool TryLock(volatile TInt32 * pFlag)
{
return !(AtomicOps::Exchange32(pFlag, 1) == 1);
}
void Lock (volatile TInt32 * pFlag)
{
while (AtomicOps::Exchange32(pFlag, 1) == 1) {
AtomicOps::ThreadYield();
}
}
void Unlock (volatile TInt32 * pFlag)
{
*pFlag = 0; // is this ok? or here as well a atomicity is needed for load and store
}
Where AtomicOps::Exchange32 is implemented on windows using InterlockedExchange and on linux using __atomic_exchange_n.

In most cases, for releasing the resource, just resetting the lock to zero (as you do) is almost OK (e.g. on an Intel Core processor) but you need also to make sure that the compiler will not exchange instructions (see below, see also g-v's post). If you want to be rigorous (and portable), there are two things that need to be considered :
What the compiler does: It may exchange instructions for optimizing the code, and thus introduce some subtle bugs if it is not "aware" of the multithreaded nature of the code. To avoid that, it is possible to insert a compiler barrier.
What the processor does: Some processors (like Intel Itanium, used in professional servers, or ARM processors used in smart phones) have a so-called "relaxed memory model". In practice, it means that the processor may decide to change the order of the operations. Again, this can be avoided by using special instructions (load barrier and store barrier). For instance, in an ARM processor, the instruction DMB ensures that all store operations are completed before the next instruction (and it needs to be inserted in the function that releases a lock)
Conclusion: It is very tricky to make the code correct, if you have some compiler / OS support for these functionalities (e.g., stdatomics.h, or std::atomic in C++0x), it is much better to rely on them than writing your own (but sometimes you have no choice). In the specific case of standard Intel Core processor, I think that what you do is correct, provided you insert a compiler-barrier in the release operation (see g-v's post).
On compile-time versus run-time memory ordering, see: https://en.wikipedia.org/wiki/Memory_ordering
My code for some atomic / spinlocks implemented on different architectures:
http://alice.loria.fr/software/geogram/doc/html/atomics_8h.html
(but I'm unsure it's 100 % correct)

You need two memory barriers in spinlock implementation:
"acquire barrier" or "import barrier" in TryLock() and Lock(). It forces operations issued while spinlock is acquired to be visible only after pFlag value is updated.
"release barrier" or "export barrier" in Unlock(). It forces operations issued until spinlock was released to be visible before pFlag value is updated.
You also need two compiler barriers for the same reasons.
See this article for details.
This approach is for generic case. On x86/64:
there are no separate acquire/release barriers, but only single full barrier (memory fence);
there is no need for memory barriers here at all, since this architecture is strongly ordered;
you still need compiler barriers.
More details are provided here.
Below is an example implementation using GCC atomic builtins. It will work for all architectures supported by GCC:
it will insert acquire/release memory barriers on architectures where they are required (or full barrier if acquire/release barriers are not supported but architecture is weakly ordered);
it will insert compiler barriers on all architectures.
Code:
bool TryLock(volatile bool* pFlag)
{
// acquire memory barrier and compiler barrier
return !__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE);
}
void Lock(volatile bool* pFlag)
{
for (;;) {
// acquire memory barrier and compiler barrier
if (!__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE)) {
return;
}
// relaxed waiting, usually no memory barriers (optional)
while (__atomic_load_n(pFlag, __ATOMIC_RELAXED)) {
CPU_RELAX();
}
}
}
void Unlock(volatile bool* pFlag)
{
// release memory barrier and compiler barrier
__atomic_clear(pFlag, __ATOMIC_RELEASE);
}
For "relaxed waiting" loop, see this and this questions.
See also Linux kernel memory barriers as a good reference.
In your implementation:
Lock() calls AtomicOps::Exchange32() which already includes compiler barrier and perhaps acquire or full memory barrier (we don't know because you didn't provide actual arguments to __atomic_exchange_n()).
Unlock() misses both memory and compiler barriers so it's broken.
Also consider using pthread_spin_lock() if it is an option.

What is the use-case of atomic read

I understand that atomic read serializes the read operations that performed by multiple threads.
What I don't understand is what is the use case?
More interestingly, I've found some implementation of atomic read which is
static inline int32_t ASMAtomicRead32(volatile int32_t *pi32)
{
return *pi32;
}
Where the only distinction to regular read is volatile. Does it mean that atomic read is the same as volatile read?

I understand that atomic read serializes the read operations that performed by multiple threads.
It's rather wrong. How you can ensure the order of reads if there is no write which stores a different value? Even when you have both read and write, it's not necessarily serialized unless correct memory semantics is used in conjunction with both the read & write operations, e.g. 'store-with-release' and 'load-with-acquire'. In your particular example, the memory semantics is relaxed. Though on x86, one can imply acquire semantics for each load and release for each store (unless non-temporal stores are used).
What I don't understand is what is the use case?
atomic reads must ensure that the data is read in one shot and no other thread can store a part of the data in the between. Thus it usually ensures the alignment of the atomic variable (since the read of aligned machine word is atomic) or work-arounds non-aligned cases using more heavy instructions. And finally, it ensures that the read is not optimized out by the compiler nor reordered across other operations in this thread (according to the memory semantics).
Does it mean that atomic read is the same as volatile read?
In a few words, volatile was not intended for such a use-case but sometimes can be abused for it when other requirements are met as well. For your example, my analysis is the following:
int32_t is likely a machine word or less - ok.
usually, everything is aligned at least on 4 bytes boundary, though there is no guarantee in your example
volatile ensures the read is not optimized out
the is no guarantee it will not be reordered either by processor (ok for x86) or by compiler (bad)
Please refer to Arch's blog and Concurrency: Atomic and volatile in C++11 memory model for the details.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string