does the atomic instruction involve the kernel - linux

I'm reading this link to learn about futex of Linux. Here is something that I don't understand.
In order to acquire the lock, an atomic test-and-set instruction (such
as cmpxchg()) can be used to test for 0 and set to 1. In this case,
the locking thread acquires the lock without involving the kernel (and
the kernel has no knowledge that this futex exists). When the next
thread attempts to acquire the lock, the test for zero will fail and
the kernel needs to be involved.
I don't quite understand why "acquires the lock without involving the kernel".
I'm always thinking that the atomic instruction, such as test-and-set, always involves the kernel.
So why does the first time of acquiring the lock won't involve the kernel? More specifically, the atomic instruction must or may involve the kernel?

An atomic test and set instruction is just an ordinary instruction executed by user code as normal. It doesn't involve the kernel.
Futexes provide an efficient way to perform a lock and unlock operation without involving the kernel in the fast paths. However, if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

Related

x86 rep instructions, lock prefix, atomics and real-time

Consider the following case:
Thread A (evil code I do not control):
# Repeat some string operation for an arbitrarily long time
lock rep stosq ...
Thread 2 (my code, should have lock-free behaviour):
# some simple atomic operation involving a `lock` prefix
# on the beginning of the data processed by the repeating instruction
lock cmpxchg ...
Can I assume that the lock held by rep stosq will be for each individual element, and not for the instruction's execution as a whole ? Otherwise doesn't that mean that every code which should have real-time semantics (no loops, no syscalls, total functions, every operation terminates in a finite time, etc) can still be broken just by having some "evil" code in another thread doing such a thing, which would block the cmpxchg on the other thread for an abritrarily long time?
The threat model I'm worried about is a denial-of-service attack against other "users" (including kernel IRQ handlers) on a real-time OS, where the "service" guarantees include very low latency interrupt handling.
If not lock rep stos, is there anything else I should worry about?
The lock rep stosq (and lock rep movsd, lock rep cmpsd, ...) aren't legal instructions.
If they were legal; they'd be more like rep (lock stosq), locking for a single stosq.
If not lock rep stos, is there anything else I should worry about?
You might worry about very old CPUs. Specifically, the original Pentium CPUs had a flaw called the "F00F bug" (see https://en.wikipedia.org/wiki/Pentium_F00F_bug ) and old Cryix CPUs has a flaw called the "Coma bug" (see https://en.wikipedia.org/wiki/Cyrix_coma_bug ). For both of these (if the OS doesn't provide a viable work-around), unprivileged software is able to trick the CPU into "lock forever".

Is test-and-set (or other atomic RMW operation) a privileged instruction on any architecture?

Hardware provides atomic instructions like test-and-set, compare-and-swap, load-linked-store-conditional. Are these privileged instructions? That is, can only the OS execute them (and thus requiring a system call)?
I thought they are not privileged and can be called in user space. But http://faculty.salina.k-state.edu/tim/ossg/IPC_sync/ts.html seems to suggest otherwise. But then, futex(7), under certain conditions, can achieve locking without a system call, which means it must execute the instruction (like test-and-set) without privilege.
Contradiction? If so, which is right?
That page is wrong. It seems to be claiming that lock-free atomic operations are privileged on ISAs in general, but that is not the case. I've never heard of one where atomic test-and-set or any other lock-free operation required kernel mode.
If that was the case, it would require C++11 lock-free atomic read-modify-write operations to compile to system calls, but they don't on x86, ARM, AArch64, MIPS, PowerPC, or any other normal CPU. (try it on https://godbolt.org/).
It would also make "light-weight" locking (which tries to take the lock without a system call) impossible. (http://preshing.com/20111124/always-use-a-lightweight-mutex/)
Normal ISAs allow user-space to do atomic RMW operations, on memory shared between threads or even between separate processes. I'm not aware of a mechanism for disabling atomic RMW for user-space on x86. Even if there is such a thing on any ISA, it's not the normal mode of operation.
Read-only or write-only accesses are usually naturally atomic on aligned locations on all ISAs, up to a certain width (Why is integer assignment on a naturally aligned variable atomic on x86?), but atomic RMW does need hardware support.
On x86, TAS is lock bts, which is unprivileged. (Documentation for the lock prefix). x86 has a wide selection of other atomic operations, like lock add [mem], reg/immediate, lock cmpxchg [mem], reg, and even lock xadd [mem], reg which implements fetch_add when the return value is needed. (Can num++ be atomic for 'int num'?)
Most RISCs have LL/SC, including ARM, MIPS, and PowerPC, as well as all the older no longer common RISC ISAs.
futex(2) is a system call. If you call it, everything it does is in kernel mode.
It's the fallback mechanism used by light-weight locking in case there is contention, which provides OS-assisted sleep/wake. So it's not futex itself that does anything in user-space, but rather lock implementations built around futex can avoid making system calls in the uncontended or low-contention case.
(e.g. spin in user-space a few times in case the lock becomes available.)
This is what the futex(7) man page is describing. But I find it a bit weird to call it a "futex operation" if you don't actually make the system call. I guess it's operating on memory that kernel code might look at on behalf of other threads that are waiting, so the necessary semantics for modifying memory locations in user-space code depends on futex.

What was the `FUTEX_REQUEUE` bug?

I assign the Linux FUTEX(2) man page as required reading in operating systems classes, as a warning to students not to get complacent when designing synchronization primitives.
The futex() system call is the API that Linux provides to allow user-level thread synchronization primitives to sleep and wake up when necessary. The man page describes the 5 different operations that can be invoked using the futex() system call. The two fundamental operations are FUTEX_WAIT (which a thread uses to put itself to sleep when it tries to acquire a synchronization object and someone is already holding it), and FUTEX_WAKE (which a thread uses to wake up any waiting threads when it releases a synchronization object.)
The next three operations are where the fun starts. The man page description goes like this:
FUTEX_FD (present up to and including Linux 2.6.25)
[...]
Because it was inherently racy, FUTEX_FD has been removed
from Linux 2.6.26 onward.
The paper "Futexes are Tricky" by Ulrich Dreper, 2004 describes that race condition (it's a potential missed wakeup). But there's more:
FUTEX_REQUEUE (since Linux 2.5.70)
This operation was introduced in order to avoid a
"thundering herd" effect when FUTEX_WAKE is used and all
processes woken up need to acquire another futex. [...]
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
There was a race in the intended use of FUTEX_REQUEUE, so
FUTEX_CMP_REQUEUE was introduced. [...]
What was the race in FUTEX_REQUEUE? Ulrich's paper doesn't even mention it (the paper describes a function futex_requeue() that is implemented using FUTEX_CMP_REQUEUE, but not the FUTEX_REQUEUE operation).
It looks like the race condition is due to the implementation of mutex's in glibc and their disparity with futexes. FUTEX_CMP_REQUEUE seems to be needed to support the more complicated glibc mutexes:
They are much more complex because they support many more features, such as testing for deadlock, and recursive locking. Due to this, they have an internal lock protecting the extra state. This extra lock means that they cannot use the FUTEX_REQUEUE multiplex function due to a possible race.
Source: http://locklessinc.com/articles/futex_cheat_sheet/
The old requeue operation takes two addresses addr1 and addr2, first it unpark waiters on addr1, then parks them back on addr2.
The new requeue operation does all that after it verifies *addr1 == user_provided_val.
To find out the possible race condition, consider the following two threads:
wait(cv, mutex);
lock(&cv.lock);
cv.mutex_ref = &mutex;
unlock(&mutex);
let futexval = ++cv.futex;
unlock(&cv.lock);
FUTEX_WAIT(&cv.futex, futexval); // --- (1)
lock(&mutex);
broadcast(cv);
lock(&cv.lock);
let futexval = cv.futex;
unlock(&cv.lock);
FUTEX_CMP_REQUEUE(&cv.futex, // --- (2)
1 /*wake*/,
ALL /*queue*/,
&cv.mutex_ref.lock,
futexval);
Both syscall (1) and (2) are executed without lock, but it is required that they are in the same total order as the mutex lock, so that a signal doesn't appear missing to the user.
Therefore, in order to detect a wait operation reordering after the actual wake, the futexval acquired in lock is passed to kernel at (2).
Similarly, we pass futexval to the FUTEX_WAIT call at (1). This design is explicitly stated in futex man page:
When executing a futex operation that requests to block a thread,
the kernel will block only if the futex word has the value that
the calling thread supplied (as one of the arguments of the
futex() call) as the expected value of the futex word. The
loading of the futex word's value, the comparison of that value
with the expected value, and the actual blocking will happen
atomically and will be totally ordered with respect to concurrent
operations performed by other threads on the same futex word.
Thus, the futex word is used to connect the synchronization in
user space with the implementation of blocking by the kernel.
Analogously to an atomic compare-and-exchange operation that
potentially changes shared memory, blocking via a futex is an
atomic compare-and-block operation.
IMHO, the reason for calling (2) outside of lock is mainly performance. To calling wake while holding lock will lead to "hurry up and wait" situation where the waiter wakes up and unable to acquire lock.
It's also worth mentioning that the above answer is based on a history version of pthread implementation. The latest version of pthread_cond has removed the usage of REQUEUE. (check this patch for details).

Where does the wait queue for threads lies in POSIX pthread mutex lock and unlock?

I was going through concurrency section from REMZI and while going through mutex section, and I got confused about this:
To avoid busy waiting, mutex implementations employ park() / unpark() mechanism (on Sun OS) which puts a waiting thread in a queue with its thread ID. Later on during pthread_mutex_unlock() it removes one thread from the queue so that it can be picked by the scheduler. Similarly, an implementation of Futex (mutex implementation on Linux) uses the same mechanism.
It is still unclear to me where the queue lies. Is it in the address space of the running process or somewhere inside the kernel?
Another doubt I had is regarding condition variables. Do pthread_cond_wait() and pthread_cond_signal() use normal signals and wait methods, or do they use some variant of it?
Doubt 1: But, it is still unclear to me where actually does the queue lies. Is it in the address space of the running process or somewhere inside kernel.
Every mutex has an associated data structure maintained in the kernel address space, in Linux it is futex. That data structure has an associated wait queue where threads from different processes can queue up and wait to be woken up, see futex_wait kernel function.
Doubt 2: Another doubt I had is regarding condition variables, does pthread_cond_wait() and pthread_cond_signal() use normal signal and wait methods OR they use some variant of it.
Modern Linux does not use signals for condition variable signaling. See NPTL: The New Implementation of Threads for Linux for more details:
The addition of the Fast Userspace Locking (futex) into the kernel enabled a complete reimplementation of mutexes and other synchronization mechanisms without resorting to interthread signaling. The futex, in turn, was made possible by the introduction of preemptive scheduling to the kernel.

How is thread synchronization implemented, at the assembly language level?

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc

Resources