How is thread synchronization implemented, at the assembly language level?

How is thread synchronization implemented, at the assembly language level? - multithreading

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(

In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).

The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm

I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc

Related

x86 rep instructions, lock prefix, atomics and real-time

Consider the following case:
Thread A (evil code I do not control):
# Repeat some string operation for an arbitrarily long time
lock rep stosq ...
Thread 2 (my code, should have lock-free behaviour):
# some simple atomic operation involving a `lock` prefix
# on the beginning of the data processed by the repeating instruction
lock cmpxchg ...
Can I assume that the lock held by rep stosq will be for each individual element, and not for the instruction's execution as a whole ? Otherwise doesn't that mean that every code which should have real-time semantics (no loops, no syscalls, total functions, every operation terminates in a finite time, etc) can still be broken just by having some "evil" code in another thread doing such a thing, which would block the cmpxchg on the other thread for an abritrarily long time?
The threat model I'm worried about is a denial-of-service attack against other "users" (including kernel IRQ handlers) on a real-time OS, where the "service" guarantees include very low latency interrupt handling.
If not lock rep stos, is there anything else I should worry about?

The lock rep stosq (and lock rep movsd, lock rep cmpsd, ...) aren't legal instructions.
If they were legal; they'd be more like rep (lock stosq), locking for a single stosq.
If not lock rep stos, is there anything else I should worry about?
You might worry about very old CPUs. Specifically, the original Pentium CPUs had a flaw called the "F00F bug" (see https://en.wikipedia.org/wiki/Pentium_F00F_bug ) and old Cryix CPUs has a flaw called the "Coma bug" (see https://en.wikipedia.org/wiki/Cyrix_coma_bug ). For both of these (if the OS doesn't provide a viable work-around), unprivileged software is able to trick the CPU into "lock forever".

Does memory fencing blocks threads in multi-core CPUs?

I was reading the Intel instruction set guide 64-ia-32 guide
to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.
Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).
Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.
mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).
mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains. See my answer on
Are loads and stores the only instructions that gets reordered? for details; this extra slowdown was to fix an erratum. But locked operations and xchg aren't like that on Skylake; they're full memory barriers but they still allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.
With hyperthreading, I think this stalling happens per logical thread, not the whole core.
But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.
sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.
It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)
See also Does a memory barrier ensure that the cache coherence has been completed?
lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.
The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)
Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.
lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?
Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).
Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.
So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

What is the opposite of a "full memory barrier"?

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:
If we have the following instructions:
instruction 1
full_memory_barrier
instruction 2
Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.
But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?
If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:
instruction 1
memory_barrier_below_to_above
instruction 2
Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:
instruction 2
instruction 1
memory_barrier_below_to_above
But the following will be allowed (which makes this type of memory barrier pointless):
memory_barrier_below_to_above
instruction 2
instruction 1

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.
On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.
Barrier examples:
strong weak
x86 mfence none needed unless you're using NT stores
ARM dmb sy isb, dmb st, dmb ish, etc.
POWER hwsync lwsync, isync, ...
ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.
ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)
POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.
A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.
Jeff Preshing has an article about this, too: Acquire and Release semantics
The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.
This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.
You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.
What exactly does atomic_thread_fence(memory_order_release) do?
C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with #EOF in comments brought up the C11 version.)
This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says
3) A release fence A synchronizes with an atomic operation B that performs an acquire
operation on an atomic object M if there exists an atomic operation X such that A is
sequenced before X, X modifies M, and B reads the value written by X or a value written
by any side effect in the hypothetical release sequence X would head if it were a release
operation
So in the writing thread, it's talking about code like this:
stuff // including any non-atomic loads/stores
atomic_thread_fence(mo_release) // A
M=X // X
// threads that see load(M, acquire) == X also see stuff
The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.
This lets us say something about what is / isn't allowed:
It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.
It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).
It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)
As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.
Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).
CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
But with just barriers against reordering within a physical core, sequential consistency can be recovered.)
A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.
Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

Taking a semaphore must be atomic. Is Pintos's sema_down safe?

This piece of code comes from Pintos source:
https://www.cs.usfca.edu/~benson/cs326/pintos/pintos/src/threads/synch.c
void
sema_down (struct semaphore *sema)
{
enum intr_level old_level;
ASSERT (sema != NULL);
ASSERT (!intr_context ());
old_level = intr_disable ();
while (sema->value == 0)
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
sema->value--;
intr_set_level (old_level);
}
The fact of taking a semaphore is sema->value--;. If it works it must be an atomic one operation.
How can we know that it is atomic operation in fact? I know that modern CPU guarantees that aligned memory operation ( for word/doubleword/quadword- it depends on) are atomic. But, here, I am not convinced why it is atomic.

TL:DR: Anything is atomic if you do it with interrupts disabled on a UP system, as long as you don't count system devices observing memory with DMA.
Note the intr_disable (); / intr_set_level (old_level); around the operation.
modern CPU guarantees that aligned memory operation are atomic
For multi-threaded observers, that only applies to separate loads or stores, not read-modify-write operations.
For something to be atomic, we have to consider what potential observers we care about. What matters is that nothing can observe the operation as having partially happened. The most straightforward way to achieve that is for the operation to be physically / electrically instantaneous, and affect all the bits simultaneously (e.g. a load or store on a parallel bus goes from not-started to completed at the boundary of a clock cycle, so it's atomic "for free" up to the width of the parallel bus). That's not possible for a read-modify-write, where the best we can do is stop observers from looking between the load and the store.
My answer on Atomicity on x86 explained the same thing a different way, about what it means to be atomic.
In a uniprocessor (UP) system, the only asynchronous observers are other system devices (e.g. DMA) and interrupt handlers. If we can exclude non-CPU observers from writing to our semaphore, then it's just atomicity with respect to interrupts that we care about.
This code takes the easy way out and disables interrupts. That's not necessary (or at least it wouldn't be if we were writing in asm).
An interrupt is handled between two instructions, never in the middle of an instruction. The architectural state of the machine either includes the memory-decrement or it doesn't, because dec [mem] either ran or it didn't. We don't actually need lock dec [mem] for this.
BTW, this is the use-case for cmpxchg without a lock prefix. I always used to wonder why they didn't just make lock implicit in cmpxchg, and the reason is that UP systems often don't need lock prefixes.
The exceptions to this rule are interruptible instructions that can record partial progress, like rep movsb or vpgather / vpscatter See Interrupting instruction in the middle of execution These won't be atomic wrt. interrupts even when the only observer is other code on the same core. Only a single iteration of rep whatever, or a single element of a gather or scatter, will have happened or not.
Most SIMD instructions can't record partial progress, so for example vmovdqu ymm0, [rdi] either fully happens or not at all from the PoV of the core it runs on. (But not of course guaranteed atomic wrt. other observers in the system, like DMA or MMIO, or other cores. That's when the normal load/store atomicity guarantees matter.)
There's no reliable way to make sure the compiler emits dec [value] instead of something like this:
mov eax, [value]
;; interrupt here = bad
dec eax
;; interrupt here = bad
mov [value], eax
ISO C11 / C++11 doesn't provide a way to request atomicity with respect to signal handlers / interrupts, but not other threads. They do provide atomic_signal_fence as a compiler barrier (vs. thread_fence as a barrier wrt. other threads/cores), but barriers can't create atomicity, only control ordering wrt. other operations.
C11/C++11 volatile sig_atomic_t does have this idea in mind, but it only provides atomicity for separate loads/stores, not RMW. It's a typedef for int on x86 Linux. See that question for some quotes from the standard.
On specific implementations, gcc -Wa,-momit-lock-prefix=yes will omit all lock prefixes. (GAS 2.28 docs) This is safe for single-threaded code, or a uniprocessor machine, if your code doesn't include device-driver hardware access that needs to do an atomic RMW on a MMIO location, or that uses a dummy lock add as a faster mfence.
But this is unusable in a multi-threaded program that needs to run on SMP machines, if you have some atomic RMWs between threads as well as some between a thread and a signal handler.

What happens if two theads lock a mutex concurrently?

I'm interested in how a mutex works. I understand their purpose as every website I have found explains what they do but I haven't been able to understand what happens in this case:
There are two threads running concurrently and they try to lock the mutex at the same time.
This would not be a problem on a single core as this situation could never happen, but in a multi-core system I see this as a problem. I can't see any way to prevent concurrency problems like this but they obviously exist.
Thanks for any help

It's not possible for 2 threads to lock a system wide Mutex, one will lock it the other will be blocked.
The semantics of mutex/lock! ensure that only one thread can execute
beyond the lock call at any one time. The first thread that reaches
the call acquires the lock on the mutex. Any later threads will block
at the call to mutex/lock! until the thread that owns the lock
releases the lock with mutex/unlock!.
In terms of how it's possible to implement this, take a look test-and-set.
In computer science, the test-and-set instruction is an instruction
used to write to a memory location and return its old value as a
single atomic (i.e., non-interruptible) operation. If multiple
processes may access the same memory, and if a process is currently
performing a test-and-set, no other process may begin another
test-and-set until the first process is done. CPUs may use
test-and-set instructions offered by other electronic components, such
as dual-port RAM; CPUs may also offer a test-and-set instruction
themselves.
The calling process obtains the lock if the old value was 0. It spins
writing 1 to the variable until this occurs.

A mutex, when properly implemented, can never be locked concurrently. For this, you need some atomic operations (operations that are guaranteed to be the only thing happening to an object at one moment) that have useful properties.
One such operation is xchg (exchange) in the x86 architecture. For instance xchg eax, [ebp] will read the value at the address ebp, write the value in eax to the address ebp and then set eax to the read value, while guaranteeing that these actions won't be interleaved with concurrent reads and writes to that address.
Now you can implement a mutex. To lock, load 1 into eax, exchange eax with the value of the mutex and look at eax. If it's 1, it was already locked, so you might want to sleep and try again later. If it's 0, you just locked the mutex. To unlock, simply write a value of 0 to the mutex.
Please note that I'm glossing over important details here. For instance, x86's xchg is atomic enough for pre-emptive multitasking on a single processor. When you're sharing memory between multiple processors (e.g. in a multi-core system), it won't be enough unless you use the lock prefix (e.g. lock xchg eax, [ebp], rather than xchg eax, [ebp]), which ensures that only one processor can access that memory while the instruction is executed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string