Starvation with upgrade_lock

Starvation with upgrade_lock - multithreading

I am trying to use Boost's upgrade_lock (using this example, but I run into a starvation issue.
I am actually using the code from this post, but I wanted an up-to-date discussion. I run 400 threads after the WorkerKiller. I run into the exact same problem as anoneironaut, the author of the mentionned post.
I have seen the proposition from Howard Hinnant, but I don't really want to include more external code (moreover I cannot get his to compile as of now) and a comment posted 6 months later states that "Boost uses a fair implementation now" (Dec 3 '12).
The Boost 1.55 documentation states that:
Note the the lack of reader-writer priority policies in shared_mutex. This is
due to an algorithm credited to Alexander Terekhov which lets the OS decide
which thread is the next to get the lock without caring whether a unique lock or
shared lock is being sought. This results in a complete lack of reader or writer
starvation. It is simply fair.".
And the algorithm credited to Alexander Terekhov is the one that Howard Hinnant talks about, so I would expect the 1.55 boost implementation to behave like in Howard Hinnant's answer, which is not the case. It behaves exactly like in the question.
Why is it the case that my WorkerKiller suffers of starvation?
UPDATE: It was observed with this code on:
Debian x64, Boost 1.55 (both the Debian version and one compiled from sources), with both clang++ and g++
Ubuntu x64, Boost 1.54, with both clang++ (3.4-1ubuntu1) and g++ (4.8.1-10ubuntu9)

This is a subtle one. The difference involves the concepts of shared and upgradable ownerships, and their implementations in Boost.
Let's first get the concepts of shared ownership and upgradable ownership sorted out.
For a SharedLockable, a thread must decide beforehand whether it wants to change the object (requiring exclusive ownership) or only read from it (shared ownership suffices). If a thread with shared ownership decides it wants to change the object, it first must release its shared lock on the object and then construct a new, exclusive lock. In between these two steps, the thread holds no locks at all on the object. Attempting to construct an exclusive lock from a thread that already holds a shared lock will deadlock, as the exclusive lock constructor will block until all shared locks have been released.
UpgradeLockable overcomes this limitation, by allowing to upgrade a shared lock to an exclusive lock without releasing it. That is, the thread keeps an active lock on the mutex at all times, prohibiting other threads from obtaining an exclusive lock in the meantime. Besides that, UpgradeLockable still allows all operations from SharedLockable, the former concept is a superset of the latter. The question you linked to is only concerned with the SharedLockable concept.
Neither concept, as specified by Boost, requires an implementation to be fair. However, the shared_mutex, which is Boost's minimal implementation for a SharedLockable does give the fairness guarantees quoted in your question. Note that this is an additional guarantee to what the concept actually requires.
Unfortunately, the minimal implementation for upgradable ownership, the upgrade_mutex, does not give this additional gurantee. It still implements the shared ownership concept as a requirement for upgradable ownership, but since fairness is not required for a conforming implementation, they do not provide it.
As pointed out by Howard in the comments, Terekhov's algorithm can be trivially adjusted to work with upgradable locks as well, it's just that the Boost implementation does not support this currently.

Related

Perl ithreads :shared variables - multiprocessor kernel threads - visibility

perlthrtut excerpt:
Note that a shared variable guarantees that if two or more threads try
to modify it at the same time, the internal state of the variable will
not become corrupted. However, there are no guarantees beyond this, as
explained in the next section.
Working on Linux supporting multiprocessor kernel threads.
Is there a guarantee that all threads will see the updated shared variable value ?
Consulting the perlthrtut doc as stated above there is no such guarantee.
Now the question: What can be done programmatically to guarantee that?

You ask
Is there a guarantee that all threads will see the updated shared variable value ?
Yes. :shared is that guarantee. The value will be safely and consistently and freshly updated.
The problem is simply that, without other synchronization, you don't know the order of these updates.
Consulting the perlthrtut doc as stated above there is no such guarantee.
You didn't read far enough. :)
The very next section in perlthrtut explains the kind of pitfalls you do face with perl threads: data races, which is to say, application logic races concerning shared data. Again, the shared data will be consistent and fresh and immune to corruption from (more-or-less) atomic perl opcodes. However, the high-level perl operations you perform on that shared data are not guaranteed to be atomic. $shared_var++, for instance, might be more than one atomic operation.
(If I may hazard a guess, you are perhaps thinking too much about other languages' lower level threading interfaces with their cache inconsistencies, torn words, reordered instructions, and lions and tigers and bears. Perl's model takes care of those low-level concerns for you.)

Using :shared on a variable causes all threads to reference it in the same physical memory address, so it doesn't matter which processor/core/hyper-thread they happen to be executing in. The perlthrtut talk of guarantees is in reference to race conditions, and in short, that you need to take into account that shared variables can be modified by any thread at any time. If this is a problem you'll need to make use of synchronization functions (e.g. lock() and cond_wait()) to control access.

You seem to be confused as to what :shared does. It makes it so a variable is shared by all threads.
A variable is indeed guaranteed to have the value it has, no matter which thread accesses it. It's a tautology, so nothing can be done to programmatically guarantee that.

How safe is pthread robust mutex?

I m thinking to use Posix robust mutexes to protect shared resource among different processes (on Linux). However there are some doubts about safety in difference scenarios. I have the following questions:
Are robust mutexes implemented in the kernel or in user code?
If latter, what would happen if a process happens to crash while in a call to pthread_mutex_lock or pthread_mutex_unlock and while a shared pthread_mutex datastructure is getting updated?
I understand that if a process locked the mutex and dies, a thread in another process will be awaken and return EOWNERDEAD. However, what would happen if the process dies (in unlikely case) exactly when the pthread_mutex datastructure (in shared memory) is being updated? Will the mutex get corrupted in that case? What would happen to another process that is mapped to the same shared memory if it were to call a pthread_mutex function?
Can the mutex still be recovered in this case?
This question applies to any pthread object with PTHREAD_PROCESS_SHARED attribute. Is it safe to call functions like pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_signal, etc. concurrently on the same object from different processes? Are they thread-safe across different processes?

From the man-page for pthreads:
Over time, two threading implementations have been provided by the
GNU C library on Linux:
LinuxThreads
This is the original Pthreads implementation. Since glibc
2.4, this implementation is no longer supported.
NPTL (Native POSIX Threads Library)
This is the modern Pthreads implementation. By comparison
with LinuxThreads, NPTL provides closer conformance to the
requirements of the POSIX.1 specification and better
performance when creating large numbers of threads. NPTL is
available since glibc 2.3.2, and requires features that are
present in the Linux 2.6 kernel.
Both of these are so-called 1:1 implementations, meaning that each
thread maps to a kernel scheduling entity. Both threading
implementations employ the Linux clone(2) system call. In NPTL,
thread synchronization primitives (mutexes, thread joining, and so
on) are implemented using the Linux futex(2) system call.
And from man futex(7):
In its bare form, a futex is an aligned integer which is touched only
by atomic assembler instructions. Processes can share this integer
using mmap(2), via shared memory segments or because they share
memory space, in which case the application is commonly called
multithreaded.
An additional remark found here:
(In case you’re wondering how they work in shared memory: Futexes are keyed upon their physical address)
Summarizing, Linux decided to implement pthreads on top of their "native" futex primitive, which indeed lives in the user process address space. For shared synchronization primitives, this would be shared memory and the other processes will still be able to see it, after one process dies.
What happens in case of process termination? Ingo Molnar wrote an article called Robust Futexes about just that. The relevant quote:
Robust Futexes
There is one race possible though: since adding to and removing from the
list is done after the futex is acquired by glibc, there is a few
instructions window for the thread (or process) to die there, leaving
the futex hung. To protect against this possibility, userspace (glibc)
also maintains a simple per-thread 'list_op_pending' field, to allow the
kernel to clean up if the thread dies after acquiring the lock, but just
before it could have added itself to the list. Glibc sets this
list_op_pending field before it tries to acquire the futex, and clears
it after the list-add (or list-remove) has finished
Summary
Where this leaves you for other platforms, is open-ended. Suffice it to say that the Linux implementation, at least, has taken great care to meet our common-sense expectation of robustness.
Seeing that other operating systems usually resort to Kernel-based synchronization primitives in the first place, it makes sense to me to assume their implementations would be even more naturally robust.

Following the documentation from here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_getrobust.html, it does read that in a fully POSIX compliant OS, shared mutex with the robust flag will behave in the way you'd expect.
The problem obviously is that not all OS are fully POSIX compliant. Not even those claiming to be. Process shared mutexes and in particular robust ones are among those finer points that are often not part of an OS's implementation of POSIX.

Are "benaphores" worth implementing on modern OS's?

Back in my days as a BeOS programmer, I read this article by Benoit Schillings, describing how to create a "benaphore": a method of using atomic variable to enforce a critical section that avoids the need acquire/release a mutex in the common (no-contention) case.
I thought that was rather clever, and it seems like you could do the same trick on any platform that supports atomic-increment/decrement.
On the other hand, this looks like something that could just as easily be included in the standard mutex implementation itself... in which case implementing this logic in my program would be redundant and wouldn't provide any benefit.
Does anyone know if modern locking APIs (e.g. pthread_mutex_lock()/pthread_mutex_unlock()) use this trick internally? And if not, why not?

What your article describes is in common use today. Most often it's called "Critical Section", and it consists of an interlocked variable, a bunch of flags and an internal synchronization object (Mutex, if I remember correctly). Generally, in the scenarios with little contention, the Critical Section executes entirely in user mode, without involving the kernel synchronization object. This guarantees fast execution. When the contention is high, the kernel object is used for waiting, which releases the time slice conductive for faster turnaround.
Generally, there is very little sense in implementing synchronization primitives in this day and age. Operating systems come with a big variety of such objects, and they are optimized and tested in significantly wider range of scenarios than a single programmer can imagine. It literally takes years to invent, implement and test a good synchronization mechanism. That's not to say that there is no value in trying :)

Java's AbstractQueuedSynchronizer (and its sibling AbstractQueuedLongSynchronizer) works similarly, or at least it could be implemented similarly. These types form the basis for several concurrency primitives in the Java library, such as ReentrantLock and FutureTask.
It works by way of using an atomic integer to represent state. A lock may define the value 0 as unlocked, and 1 as locked. Any thread wishing to acquire the lock attempts to change the lock state from 0 to 1 via an atomic compare-and-set operation; if the attempt fails, the current state is not 0, which means that the lock is owned by some other thread.
AbstractQueuedSynchronizer also facilitates waiting on locks and notification of conditions by maintaining CLH queues, which are lock-free linked lists representing the line of threads waiting either to acquire the lock or to receive notification via a condition. Such notification moves one or all of the threads waiting on the condition to the head of the queue of those waiting to acquire the related lock.
Most of this machinery can be implemented in terms of an atomic integer representing the state as well as a couple of atomic pointers for each waiting queue. The actual scheduling of which threads will contend to inspect and change the state variable (via, say, AbstractQueuedSynchronizer#tryAcquire(int)) is outside the scope of such a library and falls to the host system's scheduler.

How to prevent writer starvation in a read write lock in pthreads

I have some questions regarding read-write locks in POSIX Pthreads on a *nix system, say Linux for example.
I wish to know what is the default bias for read write lock i.e does it prefer reads over writes or vice versa ? Does it provide some api to change this default behaviour.
Does posix pthread provide some api so that we could change the pthread_rwlock_t to prevent writer starvation ? From what i have read(please correct me if i am wrong), the default implementation is biased towards reader threads and so writer threads can face starvation.
I have read the sample implementation of rw lock from the book Programming with Posix threads by David Butenhof.
I wish to know how posix pthreads handle starvation of writer threads ? Is there some api using which we could set the attributes of the read write lock that would prevent write starvation (i have never heard about that) ? Or does the user have to handle this problem ?
If you think that the answer is implementation-defined then please give me example of how it's done in Linux, because thats what i am looking for.
Please note that i just want solutions for a *nix system. Don't think that i am rude, but posting some windows-specific code is useless for me.
Thank you all for your help and patience :)

This does indeed depend on the implementation - so since you have asked about Linux specifically, my comments are refer to the current NPTL implementation of pthreads, which is used in modern glibc.
There are two related, but separate, issues here. Firstly, there is this situation:
There are read locks currently held, and writers waiting. A new thread tries to take a read lock.
The default action here is to allow the reader to proceed - effectively "jumping the queue" over the writer. You can, however, override this. If you use the pthread_rwlockattr_setkind_np() function to set the PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP flag on the attr that you pass to pthread_rwlock_init(), then your rwlock will block the reader in the above situation.
The second situation is:
The last holder releases the lock, and there are both readers and writers waiting.
In this situation, NPTL will always wake up a writer in preference to a reader.
Taken together, the above means that if you use the PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP flag, your writers shouldn't be starved (of course, now a continuous stream of writers can starve the readers. C'est la vie). You can confirm all this by checking the sources (it's all very readable1) in pthread_rwlock_rdlock.c and pthread_rwlock_unlock.c.
Note that there is also a PTHREAD_RWLOCK_PREFER_WRITER_NP, but it appears not to have the right effect - quite possibly a bug (or possibly not - see comment by jilles below).
1. ...or at least it was, back when I wrote this answer in 2010. The latest versions of NPTL are considerably more complex and I haven't re-done the analysis.

How can I write a lock free structure?

In my multithreaded application and I see heavy lock contention in it, preventing good scalability across multiple cores. I have decided to use lock free programming to solve this.
How can I write a lock free structure?

Short answer is:
You cannot.
Long answer is:
If you are asking this question, you do not probably know enough to be able to create a lock free structure. Creating lock free structures is extremely hard, and only experts in this field can do it. Instead of writing your own, search for an existing implementation. When you find it, check how widely it is used, how well is it documented, if it is well proven, what are the limitations - even some lock free structure other people published are broken.
If you do not find a lock free structure corresponding to the structure you are currently using, rather adapt the algorithm so that you can use some existing one.
If you still insist on creating your own lock free structure, be sure to:
start with something very simple
understand memory model of your target platform (including read/write reordering constraints, what operations are atomic)
study a lot about problems other people encountered when implementing lock free structures
do not just guess if it will work, prove it
heavily test the result
More reading:
Lock free and wait free algorithms at Wikipedia
Herb Sutter: Lock-Free Code: A False Sense of Security

Use a library such as Intel's Threading Building Blocks, it contains quite a few lock -free structures and algorithms. I really wouldn't recommend attempting to write lock-free code yourself, it's extremely error prone and hard to get right.

Writing thread-safe lock free code is hard; but this article from Herb Sutter will get you started.

As sblundy pointed out, if all objects are immutable, read-only, you don't need to worry about locking, however, this means you may have to copy objects a lot. Copying usually involves malloc and malloc uses locking to synchronize memory allocations across threads, so immutable objects may buy you less than you think (malloc itself scales rather badly and malloc is slow; if you do a lot of malloc in a performance critical section, don't expect good performance).
When you only need to update simple variables (e.g. 32 or 64 bit int or pointers), perform simply addition or subtraction operations on them or just swap the values of two variables, most platforms offer "atomic operations" for that (further GCC offers these as well). Atomic is not the same as thread-safe. However, atomic makes sure, that if one thread writes a 64 bit value to a memory location for example and another thread reads from it, the reading one either gets the value before the write operation or after the write operation, but never a broken value in-between the write operation (e.g. one where the first 32 bit are already the new, the last 32 bit are still the old value! This can happen if you don't use atomic access on such a variable).
However, if you have a C struct with 3 values, that want to update, even if you update all three with atomic operations, these are three independent operations, thus a reader might see the struct with one value already being update and two not being updated. Here you will need a lock if you must assure, the reader either sees all values in the struct being either the old or the new values.
One way to make locks scale a lot better is using R/W locks. In many cases, updates to data are rather infrequent (write operations), but accessing the data is very frequent (reading the data), think of collections (hashtables, trees). In that case R/W locks will buy you a huge performance gain, as many threads can hold a read-lock at the same time (they won't block each other) and only if one thread wants a write lock, all other threads are blocked for the time the update is performed.
The best way to avoid thread-issues is to not share any data across threads. If every thread deals most of the time with data no other thread has access to, you won't need locking for that data at all (also no atomic operations). So try to share as little data as possible between threads. Then you only need a fast way to move data between threads if you really have to (ITC, Inter Thread Communication). Depending on your operating system, platform and programming language (unfortunately you told us neither of these), various powerful methods for ITC might exist.
And finally, another trick to work with shared data but without any locking is to make sure threads don't access the same parts of the shared data. E.g. if two threads share an array, but one will only ever access even, the other one only odd indexes, you need no locking. Or if both share the same memory block and one only uses the upper half of it, the other one only the lower one, you need no locking. Though it's not said, that this will lead to good performance; especially not on multi-core CPUs. Write operations of one thread to this shared data (running one core) might force the cache to be flushed for another thread (running on another core) and these cache flushes are often the bottle neck for multithread applications running on modern multi-core CPUs.

As my professor (Nir Shavit from "The Art of Multiprocessor Programming") told the class: Please don't. The main reason is testability - you can't test synchronization code. You can run simulations, you can even stress test. But it's rough approximation at best. What you really need is mathematical correctness proof. And very few capable understanding them, let alone writing them.
So, as others had said: use existing libraries. Joe Duffy's blog surveys some techniques (section 28). The first one you should try is tree-splitting - break to smaller tasks and combine.

Immutability is one approach to avoid locking. See Eric Lippert's discussion and implementation of things like immutable stacks and queues.

in re. Suma's answer, Maurice Herlithy shows in The Art of Multiprocessor Programming that actually anything can be written without locks (see chapter 6). iirc, This essentially involves splitting tasks into processing node elements (like a function closure), and enqueuing each one. Threads will calculate the state by following all nodes from the latest cached one. Obviously this could, in worst case, result in sequential performance, but it does have important lockless properties, preventing scenarios where threads could get scheduled out for long peroids of time when they are holding locks. Herlithy also achieves theoretical wait-free performance, meaning that one thread will not end up waiting forever to win the atomic enqueue (this is a lot of complicated code).
A multi-threaded queue / stack is surprisingly hard (check the ABA problem). Other things may be very simple. Become accustomed to while(true) { atomicCAS until I swapped it } blocks; they are incredibly powerful. An intuition for what's correct with CAS can help development, though you should use good testing and maybe more powerful tools (maybe SKETCH, upcoming MIT Kendo, or spin?) to check correctness if you can reduce it to a simple structure.
Please post more about your problem. It's difficult to give a good answer without details.
edit immutibility is nice but it's applicability is limited, if I'm understanding it right. It doesn't really overcome write-after-read hazards; consider two threads executing "mem = NewNode(mem)"; they could both read mem, then both write it; not the correct for a classic increment function. Also, it's probably slow due to heap allocation (which has to be synchronized across threads).

Inmutability would have this effect. Changes to the object result in a new object. Lisp works this way under the covers.
Item 13 of Effective Java explains this technique.

Cliff Click has dome some major research on lock free data structures by utilizing finite state machines and also posted a lot of implementations for Java. You can find his papers, slides and implementations at his blog: http://blogs.azulsystems.com/cliff/

Use an existing implementation, as this area of work is the realm of domain experts and PhDs (if you want it done right!)
For example there is a library of code here:
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/

Most lock-free algorithms or structures start with some atomic operation, i.e. a change to some memory location that once begun by a thread will be completed before any other thread can perform that same operation. Do you have such an operation in your environment?
See here for the canonical paper on this subject.
Also try this wikipedia article article for further ideas and links.

The basic principle for lock-free synchronisation is this:
whenever you are reading the structure, you follow the read with a test to see if the structure was mutated since you started the read, and retry until you succeed in reading without something else coming along and mutating while you are doing so;
whenever you are mutating the structure, you arrange your algorithm and data so that there is a single atomic step which, if taken, causes the entire change to become visible to the other threads, and arrange things so that none of the change is visible unless that step is taken. You use whatever lockfree atomic mechanism exists on your platform for that step (e.g. compare-and-set, load-linked+store-conditional, etc.). In that step you must then check to see if any other thread has mutated the object since the mutation operation began, commit if it has not and start over if it has.
There are plenty of examples of lock-free structures on the web; without knowing more about what you are implementing and on what platform it is hard to be more specific.

If you are writing your own lock-free data structures for a multi-core cpu, do not forget about memory barriers! Also, consider looking into Software Transaction Memory techniques.

Well, it depends on the kind of structure, but you have to make the structure so that it carefully and silently detects and handles possible conflicts.
I doubt you can make one that is 100% lock-free, but again, it depends on what kind of structure you need to build.
You might also need to shard the structure so that multiple threads work on individual items, and then later on synchronize/recombine.

As mentioned, it really depends on what type of structure you're talking about. For instance, you can write a limited lock-free queue, but not one that allows random access.

Reduce or eliminate shared mutable state.

In Java, utilize the java.util.concurrent packages in JDK 5+ instead of writing your own. As was mentioned above, this is really a field for experts, and unless you have a spare year or two, rolling your own isn't an option.

Can you clarify what you mean by structure?
Right now, I am assuming you mean the overall architecture. You can accomplish it by not sharing memory between processes, and by using an actor model for your processes.

Take a look at my link ConcurrentLinkedHashMap for an example of how to write a lock-free data structure. It is not based on any academic papers and doesn't require years of research as others imply. It simply takes careful engineering.
My implementation does use a ConcurrentHashMap, which is a lock-per-bucket algorithm, but it does not rely on that implementation detail. It could easily be replaced with Cliff Click's lock-free implementation. I borrowed an idea from Cliff, but used much more explicitly, is to model all CAS operations with a state machine. This greatly simplifies the model, as you'll see that I have psuedo locks via the 'ing states. Another trick is to allow laziness and resolve as needed. You'll see this often with backtracking or letting other threads "help" to cleanup. In my case, I decided to allow dead nodes on the list be evicted when they reach the head, rather than deal with the complexity of removing them from the middle of the list. I may change that, but I didn't entirely trust my backtracking algorithm and wanted to put off a major change like adopting a 3-node locking approach.
The book "The Art of Multiprocessor Programming" is a great primer. Overall, though, I'd recommend avoiding lock-free designs in the application code. Often times it is simply overkill where other, less error prone, techniques are more suitable.

If you see lock contention, I would first try to use more granular locks on your data structures rather than completely lock-free algorithms.
For example, I currently work on multithreaded application, that has a custom messaging system (list of queues for each threads, the queue contains messages for thread to process) to pass information between threads. There is a global lock on this structure. In my case, I don't need speed so much, so it doesn't really matter. But if this lock would become a problem, it could be replaced by individual locks at each queue, for example. Then adding/removing element to/from the specific queue would didn't affect other queues. There still would be a global lock for adding new queue and such, but it wouldn't be so much contended.
Even a single multi-produces/consumer queue can be written with granular locking on each element, instead of having a global lock. This may also eliminate contention.

If you read several implementations and papers regarding the subject, you'll notice there is the following common theme:
1) Shared state objects are lisp/clojure style inmutable: that is, all write operations are implemented copying the existing state in a new object, make modifications to the new object and then try to update the shared state (obtained from a aligned pointer that can be updated with the CAS primitive). In other words, you NEVER EVER modify an existing object that might be read by more than the current thread. Inmutability can be optimized using Copy-on-Write semantics for big, complex objects, but thats another tree of nuts
2) you clearly specify what allowed transitions between current and next state are valid: Then validating that the algorithm is valid become orders of magnitude easier
3) Handle discarded references in hazard pointer lists per thread. After the reference objects are safe, reuse if possible
See another related post of mine where some code implemented with semaphores and mutexes is (partially) reimplemented in a lock-free style:
Mutual exclusion and semaphores

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string