x86 rep instructions, lock prefix, atomics and real-time - multithreading

Consider the following case:
Thread A (evil code I do not control):
# Repeat some string operation for an arbitrarily long time
lock rep stosq ...
Thread 2 (my code, should have lock-free behaviour):
# some simple atomic operation involving a `lock` prefix
# on the beginning of the data processed by the repeating instruction
lock cmpxchg ...
Can I assume that the lock held by rep stosq will be for each individual element, and not for the instruction's execution as a whole ? Otherwise doesn't that mean that every code which should have real-time semantics (no loops, no syscalls, total functions, every operation terminates in a finite time, etc) can still be broken just by having some "evil" code in another thread doing such a thing, which would block the cmpxchg on the other thread for an abritrarily long time?
The threat model I'm worried about is a denial-of-service attack against other "users" (including kernel IRQ handlers) on a real-time OS, where the "service" guarantees include very low latency interrupt handling.
If not lock rep stos, is there anything else I should worry about?

The lock rep stosq (and lock rep movsd, lock rep cmpsd, ...) aren't legal instructions.
If they were legal; they'd be more like rep (lock stosq), locking for a single stosq.
If not lock rep stos, is there anything else I should worry about?
You might worry about very old CPUs. Specifically, the original Pentium CPUs had a flaw called the "F00F bug" (see https://en.wikipedia.org/wiki/Pentium_F00F_bug ) and old Cryix CPUs has a flaw called the "Coma bug" (see https://en.wikipedia.org/wiki/Cyrix_coma_bug ). For both of these (if the OS doesn't provide a viable work-around), unprivileged software is able to trick the CPU into "lock forever".

Related

does the atomic instruction involve the kernel

I'm reading this link to learn about futex of Linux. Here is something that I don't understand.
In order to acquire the lock, an atomic test-and-set instruction (such
as cmpxchg()) can be used to test for 0 and set to 1. In this case,
the locking thread acquires the lock without involving the kernel (and
the kernel has no knowledge that this futex exists). When the next
thread attempts to acquire the lock, the test for zero will fail and
the kernel needs to be involved.
I don't quite understand why "acquires the lock without involving the kernel".
I'm always thinking that the atomic instruction, such as test-and-set, always involves the kernel.
So why does the first time of acquiring the lock won't involve the kernel? More specifically, the atomic instruction must or may involve the kernel?
An atomic test and set instruction is just an ordinary instruction executed by user code as normal. It doesn't involve the kernel.
Futexes provide an efficient way to perform a lock and unlock operation without involving the kernel in the fast paths. However, if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

Taking a semaphore must be atomic. Is Pintos's sema_down safe?

This piece of code comes from Pintos source:
https://www.cs.usfca.edu/~benson/cs326/pintos/pintos/src/threads/synch.c
void
sema_down (struct semaphore *sema)
{
enum intr_level old_level;
ASSERT (sema != NULL);
ASSERT (!intr_context ());
old_level = intr_disable ();
while (sema->value == 0)
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
sema->value--;
intr_set_level (old_level);
}
The fact of taking a semaphore is sema->value--;. If it works it must be an atomic one operation.
How can we know that it is atomic operation in fact? I know that modern CPU guarantees that aligned memory operation ( for word/doubleword/quadword- it depends on) are atomic. But, here, I am not convinced why it is atomic.
TL:DR: Anything is atomic if you do it with interrupts disabled on a UP system, as long as you don't count system devices observing memory with DMA.
Note the intr_disable (); / intr_set_level (old_level); around the operation.
modern CPU guarantees that aligned memory operation are atomic
For multi-threaded observers, that only applies to separate loads or stores, not read-modify-write operations.
For something to be atomic, we have to consider what potential observers we care about. What matters is that nothing can observe the operation as having partially happened. The most straightforward way to achieve that is for the operation to be physically / electrically instantaneous, and affect all the bits simultaneously (e.g. a load or store on a parallel bus goes from not-started to completed at the boundary of a clock cycle, so it's atomic "for free" up to the width of the parallel bus). That's not possible for a read-modify-write, where the best we can do is stop observers from looking between the load and the store.
My answer on Atomicity on x86 explained the same thing a different way, about what it means to be atomic.
In a uniprocessor (UP) system, the only asynchronous observers are other system devices (e.g. DMA) and interrupt handlers. If we can exclude non-CPU observers from writing to our semaphore, then it's just atomicity with respect to interrupts that we care about.
This code takes the easy way out and disables interrupts. That's not necessary (or at least it wouldn't be if we were writing in asm).
An interrupt is handled between two instructions, never in the middle of an instruction. The architectural state of the machine either includes the memory-decrement or it doesn't, because dec [mem] either ran or it didn't. We don't actually need lock dec [mem] for this.
BTW, this is the use-case for cmpxchg without a lock prefix. I always used to wonder why they didn't just make lock implicit in cmpxchg, and the reason is that UP systems often don't need lock prefixes.
The exceptions to this rule are interruptible instructions that can record partial progress, like rep movsb or vpgather / vpscatter See Interrupting instruction in the middle of execution These won't be atomic wrt. interrupts even when the only observer is other code on the same core. Only a single iteration of rep whatever, or a single element of a gather or scatter, will have happened or not.
Most SIMD instructions can't record partial progress, so for example vmovdqu ymm0, [rdi] either fully happens or not at all from the PoV of the core it runs on. (But not of course guaranteed atomic wrt. other observers in the system, like DMA or MMIO, or other cores. That's when the normal load/store atomicity guarantees matter.)
There's no reliable way to make sure the compiler emits dec [value] instead of something like this:
mov eax, [value]
;; interrupt here = bad
dec eax
;; interrupt here = bad
mov [value], eax
ISO C11 / C++11 doesn't provide a way to request atomicity with respect to signal handlers / interrupts, but not other threads. They do provide atomic_signal_fence as a compiler barrier (vs. thread_fence as a barrier wrt. other threads/cores), but barriers can't create atomicity, only control ordering wrt. other operations.
C11/C++11 volatile sig_atomic_t does have this idea in mind, but it only provides atomicity for separate loads/stores, not RMW. It's a typedef for int on x86 Linux. See that question for some quotes from the standard.
On specific implementations, gcc -Wa,-momit-lock-prefix=yes will omit all lock prefixes. (GAS 2.28 docs) This is safe for single-threaded code, or a uniprocessor machine, if your code doesn't include device-driver hardware access that needs to do an atomic RMW on a MMIO location, or that uses a dummy lock add as a faster mfence.
But this is unusable in a multi-threaded program that needs to run on SMP machines, if you have some atomic RMWs between threads as well as some between a thread and a signal handler.

How locking is implemented?

i have following code:
while(lock)
;
lock = 1;
// critical section
lock = 0;
As reading or changing lock value is in itself a multi-instruction
read lock
change value
write it
If it happens like:
1) One thread reads the lock and stops there
2) Another thread reads it and sees it is free; lock it and do something untill half
3) First thread wakes up and goes into CS
SO how would locking would be implmented in system ?
Placing variables over top of another variables is not right : it would be like Guarding the guard ?
Stopping other processors threads is also not right ?
It is 100% platform specific. Generally, the CPU provides some form of atomic operation such as exchange or compare and swap. A typical lock might work like this:
1) Create: Store 0 (unlocked) in the variable.
2) Lock: Atomically attempt to switch the value of the variable from 0 (unlocked) to 1 (locked). If we failed (because it wasn't unlocked to begin with), let the CPU rest a bit, and then retry. Use a memory barrier to ensure no future memory operations sneak behind this one.
3) Unlock: Use a memory barrier to ensure previous memory operations don't sneak past this one. Atomically write 0 (unlocked) to the variable.
Note that you really don't need to understand this unless you want to design your own synchronization primitives. And if you want to do that, you need to understand an awful lot more. It's certainly a good idea for every programmer to have a general idea of what he's making the hardware do. But this is an area filled with seriously heavy wizardry. There are so many, many ways this can go horribly wrong. So just use the locking primitives provided by the geniuses who made your platform, compiler, and threading library. Here be dragons.
For example, SMP Pentium Pro systems have an erratum that requires special handling in the unlock operation. A naive implementation of the lock algorithm will cause the branch prediction logic to expect the operation to keep spinning, incurring a massive performance penalty at the worst possible time -- when you first acquire the lock. A naive implementation of the lock algorithm may cause two cores each waiting for the same lock to saturate the bus, slowing the CPU that needs to get work done in order to release the lock to a crawl. These all require heavy wizardry and deep understanding of the hardware to deal with.
In a course I studied at Uni, a possible firmware solution for implementing locks was presented in the form of the "atomicity bit" associated to a memory operation initiated by a processor.
Basically, when locking, you'll notice that you have a sequence of operations that need to be executed atomically: test the value of the flag and, if not set, set it to locked, otherwise try again. This sequence can be made atomic by associating a bit with each memory request send by the CPU. The first N-1 operations will have the bit set, while the last one will have it unset, to mark the end of the atomic sequence.
When the memory module (there can be several modules) where the flag data is stored receives the request for the first operation in the sequence (whose bit is set), it will serve it and not take requests from any other CPU until the CPU that initiated the atomic sequence sends a request with an unset atomicity bit (since these transactions are usually short, a coarse-grain approach like this is acceptable). Note that this is usually made easier by the assembler providing specialized instructions of type "compare-and-set", that do exactly what I mentioned before.

How efficient is locking and unlocked mutex? What is the cost of a mutex?

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.
I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.
I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).
This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.
I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.
The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)
I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

How is thread synchronization implemented, at the assembly language level?

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc

Resources