LIghtweight mutex - multithreading

The citation comes from http://preshing.com/20111124/always-use-a-lightweight-mutex/
The Windows Critical Section is what we call a lightweight mutex. It’s
optimized for the case when there are no other threads competing for
the lock. To demonstrate using a simple example, here’s a single
thread which locks and unlocks a Windows Mutex exactly one million
times
Does it mean that lightweight mutex is just a smart heavy (kernel) mutex?
By "smart" I mean that only when mutex is free it doesn't make a syscall?

In summary, yes: on Windows, critical sections and mutexes are similar, but critical sections are lighter weight because they avoid a system call when there is no contention.
Windows has two different mutual-exclusion primitives: critical sections and mutexes. They serve similar functions, but critical sections are significantly faster than mutexes.
Mutexes always result in a system call down to the kernel, which requires a processor ring-mode switch and entails a significant amount of overhead. (The user-mode thread raises an exception, which is then caught by the kernel thread running in ring 0; the user-mode thread remains halted until execution returns back out of kernel mode.) Although they are slower, mutexes are much more powerful and flexible. They can be shared across processes, a waiting thread can specify a time-out period, and a waiting thread can also determine whether the thread that owned the mutex terminated or if the mutex was deleted.
Critical sections are much lighter-weight objects, and therefore much faster than mutexes. In the most common case of uncontended acquires, critical sections are incredibly fast because they just atomically increment a value in user-mode and return immediately. (Internally, the InterlockedCompareExchange API is used to "acquire" the critical section.)
Critical sections only switch to kernel mode when there is contention over the acquisition. In such cases, the critical section actually allocates a semaphore internally, storing it in a dedicated field in the critical section's structure (which is originally unallocated). So basically, in cases of contention, you see performance degrade to that of a mutex because you effectively are using a mutex. The user-mode thread is suspended and kernel-mode is entered to wait on either the semaphore or an event.
Critical sections in Windows are somewhat akin to "futexes" in Linux. A futex is a "Fast User-space muTEX" that, like a critical section, only switches to kernel-mode when arbitration is required.
The performance benefit of a critical section comes with serious caveats, including the inability specify a wait time-out period, the inability of a thread to determine if the owning thread was terminated before it released the critical section, the inability to determine if the critical section was deleted, and the inability to use critical sections across processes (critical sections are process-local objects).
As such, you should keep the following guidelines in mind when deciding between critical sections and mutexes:
If you're going to use a critical section, the operation must be non-blocking. If an operation can block (such as a socket), then you shouldn't use a critical section because it does not allow the waiting thread to specify a wait time-out, which can lead to deadlock.
If it's possible that the thread might throw an exception or be terminated unexpectedly, then use a mutex. With a critical section, there is no way for waiting threads to be notified that the original thread was terminated or that the critical section was deleted.
Critical sections make the most sense when the protected operation has a relatively short duration. These are the cases where avoiding the overhead of a mutex is most important, and also the cases where you're least likely to run into problems with a critical section.
You'll find lots of benchmarks online showing the relative performance difference between critical sections and mutexes, including in the article you link, which says critical sections are 25 times faster than mutexes. I have a comment here in my class library from an article I read a long time ago that says, "On a Pentium II 300 MHz, the round-trip for a critical section (assuming no contention, so no context switching required) takes 0.29 µs. With a mutex, it takes 5.3 µs." The consensus seems to be somewhere between 15–30% faster when you can avoid a kernel-mode transition. I didn't bother to benchmark it myself. :-)
Further reading:
Critical Section Objects on MSDN:
A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.
[ … ]
A thread uses the EnterCriticalSection or TryEnterCriticalSection function to request ownership of a critical section. It uses the LeaveCriticalSection function to release ownership of a critical section. If the critical section object is currently owned by another thread, EnterCriticalSection waits indefinitely for ownership. In contrast, when a mutex object is used for mutual exclusion, the wait functions accept a specified time-out interval.
INFO: Critical Sections Versus Mutexes, also on MSDN:
Critical sections and mutexes provide synchronization that is very similar, except that critical sections can be used only by the threads of a single process. There are two areas to consider when choosing which method to use within a single process:
Speed. The Synchronization overview says the following about critical sections:
... critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization.
Critical sections use a processor-specific test and set instruction to determine mutual exclusion.
Deadlock. The Synchronization overview says the following about mutexes:
If a thread terminates without releasing its ownership of a mutex object, the mutex is considered to be abandoned. A waiting thread can acquire ownership of an abandoned mutex, but the wait function's return value indicates that the mutex is abandoned.
WaitForSingleObject() will return WAIT_ABANDONED for a mutex that has been abandoned. However, the resource that the mutex is protecting is left in an unknown state.
There is no way to tell whether a critical section has been abandoned.
The article you link to in the question also links to this post on Larry Osterman's blog, which gives some more interesting details about the implementation.

Related

What is the difference between an atomic operation and critical section?, which of the two prevents context switching?

A programming language or the processor already has "default" atomic operations and we can use them as far as I understand.
https://en.wikipedia.org/wiki/Linearizability
What is the difference between an atomic operation and critical section?
Atomic operations are instructions that guarantee atomic accesses/updates of shared (small) variables. This generally include operations like incrementation, decrementation, addition, subtraction, compare and swap (aka. CAS), exchange, logical operations (and, or, xor) as well as basic loads/stores. If you want to perform a non trivial operation that is not supported by the target platform (or one involving large variables), then you cannot use one atomic operation. This means either multiple of them is required or another mechanism should be used instead (eg. critical section, transactional memory). Note that using multiple atomic operations often makes things significantly more complex (see ABA problem). On mainstream CPUs, atomic operations are generally implemented by locking cache lines of shared caches (eg. L3) so that only one thread can access to it at a time.
Critical sections are meant to protect one or multiple instructions from being executed by multiple threads at the same time. They are generally protected using a system mutex. The thread entering the critical section lock the associated mutex and unlock it when leaving the section. System mutexes cause the thread entering a critical section to wait if the associated mutex is already locked. This is generally done using a context switch (the thread is descheduled and rescheduled later).
Critical section can be efficient when the lock is very rarely already taken by another thread. Context switches can significantly impact the performance. Atomic operation are not great either when many thread perform atomic operations on it. Contention effects can make atomic accesses significantly slower (eg. spin locks). This is especially true for atomic CAS operations. Some platform can execute atomic operation very quickly (eg. GPUs) since they have dedicated units to execute atomic operation efficiently.
which of the two prevents context switching?
None of the two prevent context switching. Modern operating systems can perform a context switching at any time. That being said, critical section generally cause context switches: a thread trying to enter into a critical section already locked by another thread will typically enter in sleeping mode and be awaken by the OS scheduler when the other thread will unlock the section. Atomic operations do not impact the scheduling of the system (at least not on mainstream platforms).
Note that the above text is also true for processes.
Speaking only to the nomenclature question:
"Atomic" means "cannot be broken down into smaller parts." In programming, an operation performed by one thread is "atomic" (as seen from other threads) if there is no possible way for the other threads to see the operation in a half-way done state. From the point of view of other threads, it's as if the entire operation happened in a single instant. It either has already happened, or it hasn't happened yet. There is no in between.
As Jérôme Richard points out, modern computer hardware provides atomic operations on simple variables. We can use those to make more complex operations seem "atomic" from the point of view of other threads either by using the hardware atomics in tricky non-blocking algorithms, or by using the hardware atomics in the implementation of mutex locks.
"Critical section" comes from a time before multi-threading. In operating system kernel code, and in "bare metal" application code, there has always been a limited form of concurrency between the main body of code and the interrupt handlers. "Critical section," back in the day, referred to a routine in the main body of code that was protected from interference by the interrupt handlers by executing it with interrupts disabled.
Systems programmers today still use "critical section" with the original meaning, but now we also sometimes say it to talk about a routine that is executed by a thread while the thread has a mutex locked.
IMO, "critical section" encourages a somewhat less useful way of thinking about mutex locks though because it's never the code that needs protection from interference. It's always about protecting the integrity of shared data. Sometimes a programmer who worries about defining The critical section can lose sight of the fact that there may be multiple routines in the program that all access the same shared data.
IMO, this is one place where an object-oriented style of programming shines, because it's easier to keep track of what needs to be protected if it is encapsulated in private members of some object and, can only be accessed through the object's thread-safe, public methods.

Terminology question: mutex lock, spin lock, sleepable lock

All over StackOverflow and the net I see folks to distinguish mutexes and spinlocks as like mutex is a mutual exclusion lock providing acquire() and release() functions and if the lock is taken, then acquire() will allow a process to be preempted.
Nevertheless, A. Silberschatz in his Operating System Concepts says in the section 6.5:
... The simplest of these tools is the mutex lock. (In fact, the term mutex is short for mutual exclusion.) We use the mutex lock to protect critical sections and thus prevent race conditions. That is, a process must acquire the lock before entering a critical section; it releases the lock when it exits the critical section. The acquire() function acquires the lock, and the release() function releases the lock.
and then he describes a spinlock, though adding a bit later
The type of mutex lock we have been describing is also called a spinlock because the process “spins” while waiting for the lock to become available.
so as spinlock is just a type of mutex lock as opposed to sleepable locks allowing a process to be preempted. That is, spinlocks and sleepable locks are all mutexes: locks by means of acquire() and release() functions.
I see totally logical to define mutex locks in the way Silberschatz did (though a bit implicitly).
What approach would you agree with?
The type of mutex lock we have been describing is also called a spinlock because the process “spins” while waiting for the lock to become available.
Maybe you're misreading the book (that is, "The type of mutex lock we have been describing" might not refer to the exact passage you think it does), or the book is outdated. Modern terminology is quite clear in what a mutex is, but spinlocks get a bit muddy.
A mutex is a concurrency primitive that allows one agent at a time access to its resource, while the others have to wait in the meantime until it the exclusive access is released. How they wait is not specified and irrelevant, their process might go to sleep, get written to disk, spin in a loop, or perhaps you are using cooperative concurrency (also known as "asynchronous programming") and passing control to the event loop as your 'waiting operation'.
A spinlock does not have a clear definition. It might be used to refer to:
A synonym for mutex (this is in my opinion wrong, but it happens).
A specific mutex implementation that always waits in a busy loop.
Any sort of busy-waiting loop waiting for a resource. A semaphore, for example, might also get implemented using a 'spinlock'.
I would consider any use of the word to refer to a (part of a) specific implementation of a concurrency primitive that waits in a busy loop to be correct, if a more general term is not appropriate. That is, use mutex (or whatever primitive you desire) unless you specifically want to talk about a busy-waiting concurrency primitive.
The words that one author uses in one book or manual will not always have the same exact meaning in every book and every manual. The meanings of the words evolve over time, and it can happen fast when the words are names for new ideas.
Not every book was written at the same time. Not every author is the same age or had the same teachers. It's just something you'll have to get used to.
"Mutex" was a name for a new idea not so very long ago.
In one book, it might mean nothing more than a thing that keeps two or more threads from entering the same critical section at the same time. In another book, it might refer to a specific type of object in a certain operating system or library that is used for that same purpose.
A spinlock is a lock/mutex whose implementation relies mainly on a spinning loop.
More advanced locks/mutexes may have spinning parts in their implementation, however those often last for no more than a few microseconds or so.

Is a spinlock lock free?

I am a little bit confused about the two concepts.
definition of lock-free on wiki:
A non-blocking algorithm is lock-free if there is guaranteed
system-wide progress
definition of non-blocking:
an algorithm is called non-blocking if failure or suspension of any
thread cannot cause failure or suspension of another thread
I thought spinlock is lock-free, or at least non-blocking. But now I'm not sure. Because by definition, "spinlock is not lock-free" also makes sense to me. Like, if the thread holding the spinlock gets suspended, then it will cause suspension of other threads spinning outside. So, by definition, spinlock is not even non-blocking, let alone lock-free.
I'm so confused now. Can anyone explain it clearly?
Anything that can be called a lock (exclude other threads from a critical section until the current thread unlocks) is by definition not lock-free. And yes, spinlocks are a kind of lock.
If a thread sleeps while holding the lock, no other thread can acquire it and make forward progress, and spinlocks can't prevent this. The OS can de-schedule a thread whenever it wants, even if it's in the middle of a critical section.
Note that "lock-free" isn't the same thing as "wait-free", so a lock-free algorithm can still have stuff like cmpxchg retry loops, but as long as one thread succeeds every time, it's lock free.
A wait-free algorithm can't even have that, and at most has to wait for cache misses / hardware arbitration of contended atomic operations. Wikipedia's non-blocking algorithm article defines wait-free and lock-free in more detail.
I think you're mixing up two definitions of "blocking".
I think you're talking about a spin_trylock function that tries to acquire a spinlock, and returns with an error if it fails instead of spinning. So this is non-blocking in the same sense as non-blocking I/O: fail with an error instead of waiting for resource availability.
That doesn't mean any thread in the system is making forward progress on the thing protected by the spinlock. It just means your thread can go and do something else before trying again, instead of needing to use separate threads to do something in parallel with waiting to acquire a lock.
Spinning in an infinite loop counts as blocking / not-making-progress. For this definition, there's no difference between a pure spinlock and one that (with OS assistance) sleeps until another thread unlocks.
The definition of lock-free isn't concerned with wasting CPU time / power to make room for independent work to happen.
Somewhat related: acquiring an uncontended spinlock doesn't require a system call, which means it's a "light-weight" lock. Some lock implementations always use a (relatively slow) system call even in the uncontended case. See Jeff Preshing's Always Use a Lightweight Mutex article. Also read Jeff's other posts to learn more about lock-free programming, because they're excellent. So good in fact that the [lock-free] tag wiki links to them.

Difference between Mutex, Semaphore & Spin Locks

I am doing experiments with IPC, especially with Mutex, Semaphore and Spin Lock.
What I learnt is Mutex is used for Asynchronous Locking (with sleeping (as per theories I read on NET)) Mechanism, Semaphore are Synchronous Locking (with Signaling and Sleeping) Mechanism, and Spin Locks are Synchronous but Non-sleeping Mechanism.
Can anyone help me to clarify these stuff deeply?
And another doubt is about Mutex, when I wrote program with thread & mutex, while one thread is running another thread is not in Sleep state but it continuously tries to acquire the Lock. So Mutex is sleeping or Non-sleeping???
First, remember the goal of these 'synchronizing objects' :
These objects were designed to provide an efficient and coherent use of 'shared data' between more than 1 thread among 1 process or from different processes.
These objects can be 'acquired' or 'released'.
That is it!!! End of story!!!
Now, if it helps to you, let me put my grain of sand:
1) Critical Section= User object used for allowing the execution of just one active thread from many others within one process. The other non selected threads (# acquiring this object) are put to sleep.
[No interprocess capability, very primitive object].
2) Mutex Semaphore (aka Mutex)= Kernel object used for allowing the execution of just one active thread from many others, within one process or among different processes. The other non selected threads (# acquiring this object) are put to sleep. This object supports thread ownership, thread termination notification, recursion (multiple 'acquire' calls from same thread) and 'priority inversion avoidance'.
[Interprocess capability, very safe to use, a kind of 'high level' synchronization object].
3) Counting Semaphore (aka Semaphore)= Kernel object used for allowing the execution of a group of active threads from many others, within one process or among different processes. The other non selected threads (# acquiring this object) are put to sleep.
[Interprocess capability however not very safe to use because it lacks following 'mutex' attributes: thread termination notification, recursion?, 'priority inversion avoidance'?, etc].
4) And now, talking about 'spinlocks', first some definitions:
Critical Region= A region of memory shared by 2 or more processes.
Lock= A variable whose value allows or denies the entrance to a 'critical region'. (It could be implemented as a simple 'boolean flag').
Busy waiting= Continuosly testing of a variable until some value appears.
Finally:
Spin-lock (aka Spinlock)= A lock which uses busy waiting. (The acquiring of the lock is made by xchg or similar atomic operations).
[No thread sleeping, mostly used at kernel level only. Ineffcient for User level code].
As a last comment, I am not sure but I can bet you some big bucks that the above first 3 synchronizing objects (#1, #2 and #3) make use of this simple beast (#4) as part of their implementation.
Have a good day!.
References:
-Real-Time Concepts for Embedded Systems by Qing Li with Caroline Yao (CMP Books).
-Modern Operating Systems (3rd) by Andrew Tanenbaum (Pearson Education International).
-Programming Applications for Microsoft Windows (4th) by Jeffrey Richter (Microsoft Programming Series).
Here is a great explanation of the difference between semaphores and mutexes:
http://blog.feabhas.com/2009/09/mutex-vs-semaphores-–-part-1-semaphores/
The short answer has to do with ownership at least with binary semaphores but I suggest you read the entire article.
Mutex is the locking mechanism while the semaphore is the wait and signal mechanism.
Both have different applications.
There is a very good explanation given by the IISC professor.
Link for video

What is the difference between semaphore and mutex in implementation?

I read that mutex and binary semaphore are different in only one aspect, in the case of mutex the locking thread has to unlock, but in semaphore the locking and unlocking thread can be different?
Which one is more efficient?
Assuming you know the basic differences between a sempahore and mutex :
For fast, simple synchronization, use a critical section.
To synchronize threads across process boundaries, use mutexes.
To synchronize access to limited resources, use a semaphore.
Apart from the fact that mutexes have an owner, the two objects may be optimized for different usage. Mutexes are designed to be held only for a short time; violating this can cause poor performance and unfair scheduling. For example, a running thread may be permitted to acquire a mutex, even though another thread is already blocked on it, creating a deadlock. Semaphores may provide more fairness, or fairness can be forced using several condition variables.

Resources