What was the `FUTEX_REQUEUE` bug? - linux

I assign the Linux FUTEX(2) man page as required reading in operating systems classes, as a warning to students not to get complacent when designing synchronization primitives.
The futex() system call is the API that Linux provides to allow user-level thread synchronization primitives to sleep and wake up when necessary. The man page describes the 5 different operations that can be invoked using the futex() system call. The two fundamental operations are FUTEX_WAIT (which a thread uses to put itself to sleep when it tries to acquire a synchronization object and someone is already holding it), and FUTEX_WAKE (which a thread uses to wake up any waiting threads when it releases a synchronization object.)
The next three operations are where the fun starts. The man page description goes like this:
FUTEX_FD (present up to and including Linux 2.6.25)
[...]
Because it was inherently racy, FUTEX_FD has been removed
from Linux 2.6.26 onward.
The paper "Futexes are Tricky" by Ulrich Dreper, 2004 describes that race condition (it's a potential missed wakeup). But there's more:
FUTEX_REQUEUE (since Linux 2.5.70)
This operation was introduced in order to avoid a
"thundering herd" effect when FUTEX_WAKE is used and all
processes woken up need to acquire another futex. [...]
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
There was a race in the intended use of FUTEX_REQUEUE, so
FUTEX_CMP_REQUEUE was introduced. [...]
What was the race in FUTEX_REQUEUE? Ulrich's paper doesn't even mention it (the paper describes a function futex_requeue() that is implemented using FUTEX_CMP_REQUEUE, but not the FUTEX_REQUEUE operation).

It looks like the race condition is due to the implementation of mutex's in glibc and their disparity with futexes. FUTEX_CMP_REQUEUE seems to be needed to support the more complicated glibc mutexes:
They are much more complex because they support many more features, such as testing for deadlock, and recursive locking. Due to this, they have an internal lock protecting the extra state. This extra lock means that they cannot use the FUTEX_REQUEUE multiplex function due to a possible race.
Source: http://locklessinc.com/articles/futex_cheat_sheet/

The old requeue operation takes two addresses addr1 and addr2, first it unpark waiters on addr1, then parks them back on addr2.
The new requeue operation does all that after it verifies *addr1 == user_provided_val.
To find out the possible race condition, consider the following two threads:
wait(cv, mutex);
lock(&cv.lock);
cv.mutex_ref = &mutex;
unlock(&mutex);
let futexval = ++cv.futex;
unlock(&cv.lock);
FUTEX_WAIT(&cv.futex, futexval); // --- (1)
lock(&mutex);
broadcast(cv);
lock(&cv.lock);
let futexval = cv.futex;
unlock(&cv.lock);
FUTEX_CMP_REQUEUE(&cv.futex, // --- (2)
1 /*wake*/,
ALL /*queue*/,
&cv.mutex_ref.lock,
futexval);
Both syscall (1) and (2) are executed without lock, but it is required that they are in the same total order as the mutex lock, so that a signal doesn't appear missing to the user.
Therefore, in order to detect a wait operation reordering after the actual wake, the futexval acquired in lock is passed to kernel at (2).
Similarly, we pass futexval to the FUTEX_WAIT call at (1). This design is explicitly stated in futex man page:
When executing a futex operation that requests to block a thread,
the kernel will block only if the futex word has the value that
the calling thread supplied (as one of the arguments of the
futex() call) as the expected value of the futex word. The
loading of the futex word's value, the comparison of that value
with the expected value, and the actual blocking will happen
atomically and will be totally ordered with respect to concurrent
operations performed by other threads on the same futex word.
Thus, the futex word is used to connect the synchronization in
user space with the implementation of blocking by the kernel.
Analogously to an atomic compare-and-exchange operation that
potentially changes shared memory, blocking via a futex is an
atomic compare-and-block operation.
IMHO, the reason for calling (2) outside of lock is mainly performance. To calling wake while holding lock will lead to "hurry up and wait" situation where the waiter wakes up and unable to acquire lock.
It's also worth mentioning that the above answer is based on a history version of pthread implementation. The latest version of pthread_cond has removed the usage of REQUEUE. (check this patch for details).

Related

Terminology question: mutex lock, spin lock, sleepable lock

All over StackOverflow and the net I see folks to distinguish mutexes and spinlocks as like mutex is a mutual exclusion lock providing acquire() and release() functions and if the lock is taken, then acquire() will allow a process to be preempted.
Nevertheless, A. Silberschatz in his Operating System Concepts says in the section 6.5:
... The simplest of these tools is the mutex lock. (In fact, the term mutex is short for mutual exclusion.) We use the mutex lock to protect critical sections and thus prevent race conditions. That is, a process must acquire the lock before entering a critical section; it releases the lock when it exits the critical section. The acquire() function acquires the lock, and the release() function releases the lock.
and then he describes a spinlock, though adding a bit later
The type of mutex lock we have been describing is also called a spinlock because the process “spins” while waiting for the lock to become available.
so as spinlock is just a type of mutex lock as opposed to sleepable locks allowing a process to be preempted. That is, spinlocks and sleepable locks are all mutexes: locks by means of acquire() and release() functions.
I see totally logical to define mutex locks in the way Silberschatz did (though a bit implicitly).
What approach would you agree with?
The type of mutex lock we have been describing is also called a spinlock because the process “spins” while waiting for the lock to become available.
Maybe you're misreading the book (that is, "The type of mutex lock we have been describing" might not refer to the exact passage you think it does), or the book is outdated. Modern terminology is quite clear in what a mutex is, but spinlocks get a bit muddy.
A mutex is a concurrency primitive that allows one agent at a time access to its resource, while the others have to wait in the meantime until it the exclusive access is released. How they wait is not specified and irrelevant, their process might go to sleep, get written to disk, spin in a loop, or perhaps you are using cooperative concurrency (also known as "asynchronous programming") and passing control to the event loop as your 'waiting operation'.
A spinlock does not have a clear definition. It might be used to refer to:
A synonym for mutex (this is in my opinion wrong, but it happens).
A specific mutex implementation that always waits in a busy loop.
Any sort of busy-waiting loop waiting for a resource. A semaphore, for example, might also get implemented using a 'spinlock'.
I would consider any use of the word to refer to a (part of a) specific implementation of a concurrency primitive that waits in a busy loop to be correct, if a more general term is not appropriate. That is, use mutex (or whatever primitive you desire) unless you specifically want to talk about a busy-waiting concurrency primitive.
The words that one author uses in one book or manual will not always have the same exact meaning in every book and every manual. The meanings of the words evolve over time, and it can happen fast when the words are names for new ideas.
Not every book was written at the same time. Not every author is the same age or had the same teachers. It's just something you'll have to get used to.
"Mutex" was a name for a new idea not so very long ago.
In one book, it might mean nothing more than a thing that keeps two or more threads from entering the same critical section at the same time. In another book, it might refer to a specific type of object in a certain operating system or library that is used for that same purpose.
A spinlock is a lock/mutex whose implementation relies mainly on a spinning loop.
More advanced locks/mutexes may have spinning parts in their implementation, however those often last for no more than a few microseconds or so.

LIghtweight mutex

The citation comes from http://preshing.com/20111124/always-use-a-lightweight-mutex/
The Windows Critical Section is what we call a lightweight mutex. It’s
optimized for the case when there are no other threads competing for
the lock. To demonstrate using a simple example, here’s a single
thread which locks and unlocks a Windows Mutex exactly one million
times
Does it mean that lightweight mutex is just a smart heavy (kernel) mutex?
By "smart" I mean that only when mutex is free it doesn't make a syscall?
In summary, yes: on Windows, critical sections and mutexes are similar, but critical sections are lighter weight because they avoid a system call when there is no contention.
Windows has two different mutual-exclusion primitives: critical sections and mutexes. They serve similar functions, but critical sections are significantly faster than mutexes.
Mutexes always result in a system call down to the kernel, which requires a processor ring-mode switch and entails a significant amount of overhead. (The user-mode thread raises an exception, which is then caught by the kernel thread running in ring 0; the user-mode thread remains halted until execution returns back out of kernel mode.) Although they are slower, mutexes are much more powerful and flexible. They can be shared across processes, a waiting thread can specify a time-out period, and a waiting thread can also determine whether the thread that owned the mutex terminated or if the mutex was deleted.
Critical sections are much lighter-weight objects, and therefore much faster than mutexes. In the most common case of uncontended acquires, critical sections are incredibly fast because they just atomically increment a value in user-mode and return immediately. (Internally, the InterlockedCompareExchange API is used to "acquire" the critical section.)
Critical sections only switch to kernel mode when there is contention over the acquisition. In such cases, the critical section actually allocates a semaphore internally, storing it in a dedicated field in the critical section's structure (which is originally unallocated). So basically, in cases of contention, you see performance degrade to that of a mutex because you effectively are using a mutex. The user-mode thread is suspended and kernel-mode is entered to wait on either the semaphore or an event.
Critical sections in Windows are somewhat akin to "futexes" in Linux. A futex is a "Fast User-space muTEX" that, like a critical section, only switches to kernel-mode when arbitration is required.
The performance benefit of a critical section comes with serious caveats, including the inability specify a wait time-out period, the inability of a thread to determine if the owning thread was terminated before it released the critical section, the inability to determine if the critical section was deleted, and the inability to use critical sections across processes (critical sections are process-local objects).
As such, you should keep the following guidelines in mind when deciding between critical sections and mutexes:
If you're going to use a critical section, the operation must be non-blocking. If an operation can block (such as a socket), then you shouldn't use a critical section because it does not allow the waiting thread to specify a wait time-out, which can lead to deadlock.
If it's possible that the thread might throw an exception or be terminated unexpectedly, then use a mutex. With a critical section, there is no way for waiting threads to be notified that the original thread was terminated or that the critical section was deleted.
Critical sections make the most sense when the protected operation has a relatively short duration. These are the cases where avoiding the overhead of a mutex is most important, and also the cases where you're least likely to run into problems with a critical section.
You'll find lots of benchmarks online showing the relative performance difference between critical sections and mutexes, including in the article you link, which says critical sections are 25 times faster than mutexes. I have a comment here in my class library from an article I read a long time ago that says, "On a Pentium II 300 MHz, the round-trip for a critical section (assuming no contention, so no context switching required) takes 0.29 µs. With a mutex, it takes 5.3 µs." The consensus seems to be somewhere between 15–30% faster when you can avoid a kernel-mode transition. I didn't bother to benchmark it myself. :-)
Further reading:
Critical Section Objects on MSDN:
A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.
[ … ]
A thread uses the EnterCriticalSection or TryEnterCriticalSection function to request ownership of a critical section. It uses the LeaveCriticalSection function to release ownership of a critical section. If the critical section object is currently owned by another thread, EnterCriticalSection waits indefinitely for ownership. In contrast, when a mutex object is used for mutual exclusion, the wait functions accept a specified time-out interval.
INFO: Critical Sections Versus Mutexes, also on MSDN:
Critical sections and mutexes provide synchronization that is very similar, except that critical sections can be used only by the threads of a single process. There are two areas to consider when choosing which method to use within a single process:
Speed. The Synchronization overview says the following about critical sections:
... critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization.
Critical sections use a processor-specific test and set instruction to determine mutual exclusion.
Deadlock. The Synchronization overview says the following about mutexes:
If a thread terminates without releasing its ownership of a mutex object, the mutex is considered to be abandoned. A waiting thread can acquire ownership of an abandoned mutex, but the wait function's return value indicates that the mutex is abandoned.
WaitForSingleObject() will return WAIT_ABANDONED for a mutex that has been abandoned. However, the resource that the mutex is protecting is left in an unknown state.
There is no way to tell whether a critical section has been abandoned.
The article you link to in the question also links to this post on Larry Osterman's blog, which gives some more interesting details about the implementation.

Where does the wait queue for threads lies in POSIX pthread mutex lock and unlock?

I was going through concurrency section from REMZI and while going through mutex section, and I got confused about this:
To avoid busy waiting, mutex implementations employ park() / unpark() mechanism (on Sun OS) which puts a waiting thread in a queue with its thread ID. Later on during pthread_mutex_unlock() it removes one thread from the queue so that it can be picked by the scheduler. Similarly, an implementation of Futex (mutex implementation on Linux) uses the same mechanism.
It is still unclear to me where the queue lies. Is it in the address space of the running process or somewhere inside the kernel?
Another doubt I had is regarding condition variables. Do pthread_cond_wait() and pthread_cond_signal() use normal signals and wait methods, or do they use some variant of it?
Doubt 1: But, it is still unclear to me where actually does the queue lies. Is it in the address space of the running process or somewhere inside kernel.
Every mutex has an associated data structure maintained in the kernel address space, in Linux it is futex. That data structure has an associated wait queue where threads from different processes can queue up and wait to be woken up, see futex_wait kernel function.
Doubt 2: Another doubt I had is regarding condition variables, does pthread_cond_wait() and pthread_cond_signal() use normal signal and wait methods OR they use some variant of it.
Modern Linux does not use signals for condition variable signaling. See NPTL: The New Implementation of Threads for Linux for more details:
The addition of the Fast Userspace Locking (futex) into the kernel enabled a complete reimplementation of mutexes and other synchronization mechanisms without resorting to interthread signaling. The futex, in turn, was made possible by the introduction of preemptive scheduling to the kernel.

performance of pthread_cond_broadcast when no one is waiting on condition

If I call pthread_cond_broadcast and no one is waiting on the condition, will the pthread_cond_broadcast invoke a context switch and/or call to kernel?
If not, can I rely on it being very fast (by fast I mean, just running a small number of assmebly instruction in current process and then returning)?
There are no guarantees in POSIX, but since your question is tagged linux and nptl an answer in that context can be given.
If there are no waiters on the condition variable, then the nptl glibc code for pthread_cond_broadcast() just takes a low-level lock protecting the internals of the condition variable itself, tests a value then unlocks the low-level lock. The low-level lock itself uses a futex, which will only enter the kernel if there is contention on that lock.
This means that unless there is a lot of contention on the condition variable itself (ie. a large number of threads frequently calling pthread_cond_broadcast() / pthread_cond_signal() on the same condition variable) there will be no system call to the kernel, and the overhead will only be a few locked instructions.
The pthread Open Group Base Specifications states:
The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no effect if there are no threads currently blocked on cond.
To get a measure of whether or not this takes "just running a small number of assmebly [sic] instruction" you'll have to get out some run-time performance-analysis tool (e.g IBM's Quantify) and run it against your code.

How is thread synchronization implemented, at the assembly language level?

While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc

Resources