What happens if two theads lock a mutex concurrently? - multithreading

I'm interested in how a mutex works. I understand their purpose as every website I have found explains what they do but I haven't been able to understand what happens in this case:
There are two threads running concurrently and they try to lock the mutex at the same time.
This would not be a problem on a single core as this situation could never happen, but in a multi-core system I see this as a problem. I can't see any way to prevent concurrency problems like this but they obviously exist.
Thanks for any help

It's not possible for 2 threads to lock a system wide Mutex, one will lock it the other will be blocked.
The semantics of mutex/lock! ensure that only one thread can execute
beyond the lock call at any one time. The first thread that reaches
the call acquires the lock on the mutex. Any later threads will block
at the call to mutex/lock! until the thread that owns the lock
releases the lock with mutex/unlock!.
In terms of how it's possible to implement this, take a look test-and-set.
In computer science, the test-and-set instruction is an instruction
used to write to a memory location and return its old value as a
single atomic (i.e., non-interruptible) operation. If multiple
processes may access the same memory, and if a process is currently
performing a test-and-set, no other process may begin another
test-and-set until the first process is done. CPUs may use
test-and-set instructions offered by other electronic components, such
as dual-port RAM; CPUs may also offer a test-and-set instruction
themselves.
The calling process obtains the lock if the old value was 0. It spins
writing 1 to the variable until this occurs.

A mutex, when properly implemented, can never be locked concurrently. For this, you need some atomic operations (operations that are guaranteed to be the only thing happening to an object at one moment) that have useful properties.
One such operation is xchg (exchange) in the x86 architecture. For instance xchg eax, [ebp] will read the value at the address ebp, write the value in eax to the address ebp and then set eax to the read value, while guaranteeing that these actions won't be interleaved with concurrent reads and writes to that address.
Now you can implement a mutex. To lock, load 1 into eax, exchange eax with the value of the mutex and look at eax. If it's 1, it was already locked, so you might want to sleep and try again later. If it's 0, you just locked the mutex. To unlock, simply write a value of 0 to the mutex.
Please note that I'm glossing over important details here. For instance, x86's xchg is atomic enough for pre-emptive multitasking on a single processor. When you're sharing memory between multiple processors (e.g. in a multi-core system), it won't be enough unless you use the lock prefix (e.g. lock xchg eax, [ebp], rather than xchg eax, [ebp]), which ensures that only one processor can access that memory while the instruction is executed.

Related

x86 rep instructions, lock prefix, atomics and real-time

Consider the following case:
Thread A (evil code I do not control):
# Repeat some string operation for an arbitrarily long time
lock rep stosq ...
Thread 2 (my code, should have lock-free behaviour):
# some simple atomic operation involving a `lock` prefix
# on the beginning of the data processed by the repeating instruction
lock cmpxchg ...
Can I assume that the lock held by rep stosq will be for each individual element, and not for the instruction's execution as a whole ? Otherwise doesn't that mean that every code which should have real-time semantics (no loops, no syscalls, total functions, every operation terminates in a finite time, etc) can still be broken just by having some "evil" code in another thread doing such a thing, which would block the cmpxchg on the other thread for an abritrarily long time?
The threat model I'm worried about is a denial-of-service attack against other "users" (including kernel IRQ handlers) on a real-time OS, where the "service" guarantees include very low latency interrupt handling.
If not lock rep stos, is there anything else I should worry about?
The lock rep stosq (and lock rep movsd, lock rep cmpsd, ...) aren't legal instructions.
If they were legal; they'd be more like rep (lock stosq), locking for a single stosq.
If not lock rep stos, is there anything else I should worry about?
You might worry about very old CPUs. Specifically, the original Pentium CPUs had a flaw called the "F00F bug" (see https://en.wikipedia.org/wiki/Pentium_F00F_bug ) and old Cryix CPUs has a flaw called the "Coma bug" (see https://en.wikipedia.org/wiki/Cyrix_coma_bug ). For both of these (if the OS doesn't provide a viable work-around), unprivileged software is able to trick the CPU into "lock forever".

Is assembly instruction "inc rax" atomic?

I know that modern CPUs have instruction pipelining, that the execution of every single machine instruction will be separated into several steps, for example, the RISC five-level pipelines. And my question is whether the assembly instruction inc rax is atomic when it is executed by different threads? Is that possible that thread A is in the Instruction Execution (EX) stage, calculating the result by incrementing the current value in register rax by 1 while thread B is in the Instruction Decoding (ID) stage, reading from the register rax of the value that is not incremented by thread A yet. So in the case, there is a data race between threads A and B, is this correct?
TL;DR: For a multithreaded program on x86-64, inc rax cannot cause or suffer any data race issues.
At the machine level, there are two senses of "atomic" that people usually use.
One is atomicity with respect to concurrent access by multiple cores. In this sense, the question doesn't really even make sense, because cores do not share registers; each has its own independent set. So a register-only instruction like inc rax cannot affect, or be affected by, anything that another core may be doing. There is certainly no data race issue to worry about.
Atomicity concerns in this sense only arise when two or more cores are accessing a shared resource - primarily memory.
The other is atomicity on a single core with respect to interrupts - if a hardware interrupt or exception occurs while an instruction is executing on the same core, what happens, and what machine state is observed by the interrupt handler? Here we do have to think about registers, because the interrupt handler can observe the same registers that the main code was using.
The answer is that x86 has precise interrupts, where interrupts appear to occur "between instructions". When calling the interrupt handler, the CPU pushes CS:RIP onto the stack, and the architectural state of the machine (registers, memory, etc) is as if:
the instruction pointed to by CS:RIP, and all subsequent instructions, have not begun to execute at all; the architectural state reflects none of their effects.
all instructions previous to CS:RIP have completely finished, and the architectural state reflects all of their effects.
On an old-fashioned in-order scalar CPU, this is easily accomplished by having the CPU check for interrupts as a step in between the completion of one instruction and the execution of the next. On a pipelined CPU, it takes more work; if there are several instructions in flight, the CPU may wait for some of them to retire, and abort the others.
For more details, see When an interrupt occurs, what happens to instructions in the pipeline?
There are a few exceptions to this rule: e.g. the AVX-512 scatter/gather instructions may be partially completed when an interrupt occurs, so that some of the loads/stores have been done and others have not. But it sets the registers in such a way that when returning to execute the instruction again, only the remaining loads/stores will be done.
From the point of view of an application on a multitasking operating system, threads can run simultaneously on several cores, or run sequentially on a single core (or some combination). In the first case, there is no problem with inc rax as the registers are not shared between cores. In the second case, each thread still has its own register set as part of its context. Your thread may be interrupted by a hardware interrupt at any time (e.g. timer tick), and the OS may then decide to schedule in a different thread. To do so, it saves your thread's context, including the register contents at the time of the interrupt - and since we have precise interrupts, these contents reflect instructions in an all-or-nothing fashion. So inc rax is atomic for that purpose; when another thread gets control, the saved context of your thread has either all the effects of inc rax or none of them. (And it usually doesn't even matter, because the only machine state affected by inc rax is registers, and other threads don't normally try to observe the saved context of threads which are scheduled out, even if the OS provides a way to do that.)

does the atomic instruction involve the kernel

I'm reading this link to learn about futex of Linux. Here is something that I don't understand.
In order to acquire the lock, an atomic test-and-set instruction (such
as cmpxchg()) can be used to test for 0 and set to 1. In this case,
the locking thread acquires the lock without involving the kernel (and
the kernel has no knowledge that this futex exists). When the next
thread attempts to acquire the lock, the test for zero will fail and
the kernel needs to be involved.
I don't quite understand why "acquires the lock without involving the kernel".
I'm always thinking that the atomic instruction, such as test-and-set, always involves the kernel.
So why does the first time of acquiring the lock won't involve the kernel? More specifically, the atomic instruction must or may involve the kernel?
An atomic test and set instruction is just an ordinary instruction executed by user code as normal. It doesn't involve the kernel.
Futexes provide an efficient way to perform a lock and unlock operation without involving the kernel in the fast paths. However, if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

Taking a semaphore must be atomic. Is Pintos's sema_down safe?

This piece of code comes from Pintos source:
https://www.cs.usfca.edu/~benson/cs326/pintos/pintos/src/threads/synch.c
void
sema_down (struct semaphore *sema)
{
enum intr_level old_level;
ASSERT (sema != NULL);
ASSERT (!intr_context ());
old_level = intr_disable ();
while (sema->value == 0)
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
sema->value--;
intr_set_level (old_level);
}
The fact of taking a semaphore is sema->value--;. If it works it must be an atomic one operation.
How can we know that it is atomic operation in fact? I know that modern CPU guarantees that aligned memory operation ( for word/doubleword/quadword- it depends on) are atomic. But, here, I am not convinced why it is atomic.
TL:DR: Anything is atomic if you do it with interrupts disabled on a UP system, as long as you don't count system devices observing memory with DMA.
Note the intr_disable (); / intr_set_level (old_level); around the operation.
modern CPU guarantees that aligned memory operation are atomic
For multi-threaded observers, that only applies to separate loads or stores, not read-modify-write operations.
For something to be atomic, we have to consider what potential observers we care about. What matters is that nothing can observe the operation as having partially happened. The most straightforward way to achieve that is for the operation to be physically / electrically instantaneous, and affect all the bits simultaneously (e.g. a load or store on a parallel bus goes from not-started to completed at the boundary of a clock cycle, so it's atomic "for free" up to the width of the parallel bus). That's not possible for a read-modify-write, where the best we can do is stop observers from looking between the load and the store.
My answer on Atomicity on x86 explained the same thing a different way, about what it means to be atomic.
In a uniprocessor (UP) system, the only asynchronous observers are other system devices (e.g. DMA) and interrupt handlers. If we can exclude non-CPU observers from writing to our semaphore, then it's just atomicity with respect to interrupts that we care about.
This code takes the easy way out and disables interrupts. That's not necessary (or at least it wouldn't be if we were writing in asm).
An interrupt is handled between two instructions, never in the middle of an instruction. The architectural state of the machine either includes the memory-decrement or it doesn't, because dec [mem] either ran or it didn't. We don't actually need lock dec [mem] for this.
BTW, this is the use-case for cmpxchg without a lock prefix. I always used to wonder why they didn't just make lock implicit in cmpxchg, and the reason is that UP systems often don't need lock prefixes.
The exceptions to this rule are interruptible instructions that can record partial progress, like rep movsb or vpgather / vpscatter See Interrupting instruction in the middle of execution These won't be atomic wrt. interrupts even when the only observer is other code on the same core. Only a single iteration of rep whatever, or a single element of a gather or scatter, will have happened or not.
Most SIMD instructions can't record partial progress, so for example vmovdqu ymm0, [rdi] either fully happens or not at all from the PoV of the core it runs on. (But not of course guaranteed atomic wrt. other observers in the system, like DMA or MMIO, or other cores. That's when the normal load/store atomicity guarantees matter.)
There's no reliable way to make sure the compiler emits dec [value] instead of something like this:
mov eax, [value]
;; interrupt here = bad
dec eax
;; interrupt here = bad
mov [value], eax
ISO C11 / C++11 doesn't provide a way to request atomicity with respect to signal handlers / interrupts, but not other threads. They do provide atomic_signal_fence as a compiler barrier (vs. thread_fence as a barrier wrt. other threads/cores), but barriers can't create atomicity, only control ordering wrt. other operations.
C11/C++11 volatile sig_atomic_t does have this idea in mind, but it only provides atomicity for separate loads/stores, not RMW. It's a typedef for int on x86 Linux. See that question for some quotes from the standard.
On specific implementations, gcc -Wa,-momit-lock-prefix=yes will omit all lock prefixes. (GAS 2.28 docs) This is safe for single-threaded code, or a uniprocessor machine, if your code doesn't include device-driver hardware access that needs to do an atomic RMW on a MMIO location, or that uses a dummy lock add as a faster mfence.
But this is unusable in a multi-threaded program that needs to run on SMP machines, if you have some atomic RMWs between threads as well as some between a thread and a signal handler.

How is thread synchronization implemented, at the assembly language level?

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc

Resources