I know that modern CPUs have instruction pipelining, that the execution of every single machine instruction will be separated into several steps, for example, the RISC five-level pipelines. And my question is whether the assembly instruction inc rax is atomic when it is executed by different threads? Is that possible that thread A is in the Instruction Execution (EX) stage, calculating the result by incrementing the current value in register rax by 1 while thread B is in the Instruction Decoding (ID) stage, reading from the register rax of the value that is not incremented by thread A yet. So in the case, there is a data race between threads A and B, is this correct?
TL;DR: For a multithreaded program on x86-64, inc rax cannot cause or suffer any data race issues.
At the machine level, there are two senses of "atomic" that people usually use.
One is atomicity with respect to concurrent access by multiple cores. In this sense, the question doesn't really even make sense, because cores do not share registers; each has its own independent set. So a register-only instruction like inc rax cannot affect, or be affected by, anything that another core may be doing. There is certainly no data race issue to worry about.
Atomicity concerns in this sense only arise when two or more cores are accessing a shared resource - primarily memory.
The other is atomicity on a single core with respect to interrupts - if a hardware interrupt or exception occurs while an instruction is executing on the same core, what happens, and what machine state is observed by the interrupt handler? Here we do have to think about registers, because the interrupt handler can observe the same registers that the main code was using.
The answer is that x86 has precise interrupts, where interrupts appear to occur "between instructions". When calling the interrupt handler, the CPU pushes CS:RIP onto the stack, and the architectural state of the machine (registers, memory, etc) is as if:
the instruction pointed to by CS:RIP, and all subsequent instructions, have not begun to execute at all; the architectural state reflects none of their effects.
all instructions previous to CS:RIP have completely finished, and the architectural state reflects all of their effects.
On an old-fashioned in-order scalar CPU, this is easily accomplished by having the CPU check for interrupts as a step in between the completion of one instruction and the execution of the next. On a pipelined CPU, it takes more work; if there are several instructions in flight, the CPU may wait for some of them to retire, and abort the others.
For more details, see When an interrupt occurs, what happens to instructions in the pipeline?
There are a few exceptions to this rule: e.g. the AVX-512 scatter/gather instructions may be partially completed when an interrupt occurs, so that some of the loads/stores have been done and others have not. But it sets the registers in such a way that when returning to execute the instruction again, only the remaining loads/stores will be done.
From the point of view of an application on a multitasking operating system, threads can run simultaneously on several cores, or run sequentially on a single core (or some combination). In the first case, there is no problem with inc rax as the registers are not shared between cores. In the second case, each thread still has its own register set as part of its context. Your thread may be interrupted by a hardware interrupt at any time (e.g. timer tick), and the OS may then decide to schedule in a different thread. To do so, it saves your thread's context, including the register contents at the time of the interrupt - and since we have precise interrupts, these contents reflect instructions in an all-or-nothing fashion. So inc rax is atomic for that purpose; when another thread gets control, the saved context of your thread has either all the effects of inc rax or none of them. (And it usually doesn't even matter, because the only machine state affected by inc rax is registers, and other threads don't normally try to observe the saved context of threads which are scheduled out, even if the OS provides a way to do that.)
Related
The mfence documentation says the following:
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction.
As far as I know, there is no fence instruction in x86 that prevents the reordering of non read and non write instructions.
Now if my program only have one thread, even if the instructions are reordered, it would still seem as if the instructions are executing in order.
But what if my program have multiple threads, and in one of the threads the non read and non write instructions are reordered, will the other threads notice this reordering (I assume the answer is No, or else there would be a fence instruction to stop the non read and non write instructions reordering, or maybe I'm missing something)?
will the other threads notice this reordering
No, other than performance (timing or direct measurement with HW performance counters). Or microarchitectural side-channels (like ALU port pressure for logical cores that share a physical core with Hyperthreading / SMT): one thread can time itself to learn something about what the other hardware thread is executing.
The only "normal" way for threads to observe anything about each other is by loading data that other threads stored.
Even load ordering is only visible indirectly (by the effect it has on what the other thread decides to later store).
As far as I know, there is no fence instruction in x86 that prevents the reordering of non read and non write instructions.
On Intel CPUs (but not AMD), lfence does this. Intel's manual says so, this is not just an implementation detail. It's actually guaranteed for future microarchitectures.
Intel's LFENCE instruction-set reference manual entry:
LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.
(completed locally = retired from the out-of-order core, i.e. leaves the ROB).
lfence is not particularly useful as an actual load barrier, because x86 doesn't allow weakly-ordered loads from WB memory (only from WC). (Not even movntdqa or prefetchnta can create weakly-ordered loads from normal WB memory.) So unlike sfence, lfence is basically never needed for memory ordering, only for its special effects, like lfence ; rdtsc. Or for Spectre mitigation, to block speculative execution past it.
But as an implementation detail, on Intel CPUs including at least Skylake, mfence is a barrier for out-of-order execution. See Are loads and stores the only instructions that gets reordered? for that, and much more related stuff.
This piece of code comes from Pintos source:
https://www.cs.usfca.edu/~benson/cs326/pintos/pintos/src/threads/synch.c
void
sema_down (struct semaphore *sema)
{
enum intr_level old_level;
ASSERT (sema != NULL);
ASSERT (!intr_context ());
old_level = intr_disable ();
while (sema->value == 0)
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
sema->value--;
intr_set_level (old_level);
}
The fact of taking a semaphore is sema->value--;. If it works it must be an atomic one operation.
How can we know that it is atomic operation in fact? I know that modern CPU guarantees that aligned memory operation ( for word/doubleword/quadword- it depends on) are atomic. But, here, I am not convinced why it is atomic.
TL:DR: Anything is atomic if you do it with interrupts disabled on a UP system, as long as you don't count system devices observing memory with DMA.
Note the intr_disable (); / intr_set_level (old_level); around the operation.
modern CPU guarantees that aligned memory operation are atomic
For multi-threaded observers, that only applies to separate loads or stores, not read-modify-write operations.
For something to be atomic, we have to consider what potential observers we care about. What matters is that nothing can observe the operation as having partially happened. The most straightforward way to achieve that is for the operation to be physically / electrically instantaneous, and affect all the bits simultaneously (e.g. a load or store on a parallel bus goes from not-started to completed at the boundary of a clock cycle, so it's atomic "for free" up to the width of the parallel bus). That's not possible for a read-modify-write, where the best we can do is stop observers from looking between the load and the store.
My answer on Atomicity on x86 explained the same thing a different way, about what it means to be atomic.
In a uniprocessor (UP) system, the only asynchronous observers are other system devices (e.g. DMA) and interrupt handlers. If we can exclude non-CPU observers from writing to our semaphore, then it's just atomicity with respect to interrupts that we care about.
This code takes the easy way out and disables interrupts. That's not necessary (or at least it wouldn't be if we were writing in asm).
An interrupt is handled between two instructions, never in the middle of an instruction. The architectural state of the machine either includes the memory-decrement or it doesn't, because dec [mem] either ran or it didn't. We don't actually need lock dec [mem] for this.
BTW, this is the use-case for cmpxchg without a lock prefix. I always used to wonder why they didn't just make lock implicit in cmpxchg, and the reason is that UP systems often don't need lock prefixes.
The exceptions to this rule are interruptible instructions that can record partial progress, like rep movsb or vpgather / vpscatter See Interrupting instruction in the middle of execution These won't be atomic wrt. interrupts even when the only observer is other code on the same core. Only a single iteration of rep whatever, or a single element of a gather or scatter, will have happened or not.
Most SIMD instructions can't record partial progress, so for example vmovdqu ymm0, [rdi] either fully happens or not at all from the PoV of the core it runs on. (But not of course guaranteed atomic wrt. other observers in the system, like DMA or MMIO, or other cores. That's when the normal load/store atomicity guarantees matter.)
There's no reliable way to make sure the compiler emits dec [value] instead of something like this:
mov eax, [value]
;; interrupt here = bad
dec eax
;; interrupt here = bad
mov [value], eax
ISO C11 / C++11 doesn't provide a way to request atomicity with respect to signal handlers / interrupts, but not other threads. They do provide atomic_signal_fence as a compiler barrier (vs. thread_fence as a barrier wrt. other threads/cores), but barriers can't create atomicity, only control ordering wrt. other operations.
C11/C++11 volatile sig_atomic_t does have this idea in mind, but it only provides atomicity for separate loads/stores, not RMW. It's a typedef for int on x86 Linux. See that question for some quotes from the standard.
On specific implementations, gcc -Wa,-momit-lock-prefix=yes will omit all lock prefixes. (GAS 2.28 docs) This is safe for single-threaded code, or a uniprocessor machine, if your code doesn't include device-driver hardware access that needs to do an atomic RMW on a MMIO location, or that uses a dummy lock add as a faster mfence.
But this is unusable in a multi-threaded program that needs to run on SMP machines, if you have some atomic RMWs between threads as well as some between a thread and a signal handler.
I'm interested in how a mutex works. I understand their purpose as every website I have found explains what they do but I haven't been able to understand what happens in this case:
There are two threads running concurrently and they try to lock the mutex at the same time.
This would not be a problem on a single core as this situation could never happen, but in a multi-core system I see this as a problem. I can't see any way to prevent concurrency problems like this but they obviously exist.
Thanks for any help
It's not possible for 2 threads to lock a system wide Mutex, one will lock it the other will be blocked.
The semantics of mutex/lock! ensure that only one thread can execute
beyond the lock call at any one time. The first thread that reaches
the call acquires the lock on the mutex. Any later threads will block
at the call to mutex/lock! until the thread that owns the lock
releases the lock with mutex/unlock!.
In terms of how it's possible to implement this, take a look test-and-set.
In computer science, the test-and-set instruction is an instruction
used to write to a memory location and return its old value as a
single atomic (i.e., non-interruptible) operation. If multiple
processes may access the same memory, and if a process is currently
performing a test-and-set, no other process may begin another
test-and-set until the first process is done. CPUs may use
test-and-set instructions offered by other electronic components, such
as dual-port RAM; CPUs may also offer a test-and-set instruction
themselves.
The calling process obtains the lock if the old value was 0. It spins
writing 1 to the variable until this occurs.
A mutex, when properly implemented, can never be locked concurrently. For this, you need some atomic operations (operations that are guaranteed to be the only thing happening to an object at one moment) that have useful properties.
One such operation is xchg (exchange) in the x86 architecture. For instance xchg eax, [ebp] will read the value at the address ebp, write the value in eax to the address ebp and then set eax to the read value, while guaranteeing that these actions won't be interleaved with concurrent reads and writes to that address.
Now you can implement a mutex. To lock, load 1 into eax, exchange eax with the value of the mutex and look at eax. If it's 1, it was already locked, so you might want to sleep and try again later. If it's 0, you just locked the mutex. To unlock, simply write a value of 0 to the mutex.
Please note that I'm glossing over important details here. For instance, x86's xchg is atomic enough for pre-emptive multitasking on a single processor. When you're sharing memory between multiple processors (e.g. in a multi-core system), it won't be enough unless you use the lock prefix (e.g. lock xchg eax, [ebp], rather than xchg eax, [ebp]), which ensures that only one processor can access that memory while the instruction is executed.
What will happen if in the middle of a long instruction the CPU recieves interruption? Will the CPU execute the whole instruction or only part of it?
From a programmer's point of view, either a specific instruction is retired, and all its side effects committed to registers/memory or it isn't (and it's as if the instruction wasn't executed at all). The whole point of instruction retirement is to guarantee a coherent view of the program state at the point of external events, such as interrupts.
That's notably why instructions retire in order, so that external observers can still look at the architectural state of the CPU as if it was executing sequentially instructions.
There are exceptions to this, notably the REP-string class of instructions.
I believe this is what you asked about, but if it is not, then let me ask you: how would you observe that an instruction was "partially" executed from anywhere ?
As far as I know, it depends on the processor and the instruction. More specifically, it depends whether and when the processor samples for pending interrupts. If the processor only looks for pending interrupts after completing the current instruction, then clearly nothing will interrupt an instruction in the middle. However, if a long instruction is executing, it might be beneficial (latency-wise) to sample for interrupts several times during the instruction's execution. There is a downside to this -- in this case, you would have to restore all changes that were made to registers and flags as a result of the instruction's partial execution, because after the interrupt completes you would have to go back and reissue that instruction.
(Source: Wikipedia, http://en.wikipedia.org/wiki/Interrupt)
While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc