Releasing multiple locks without causing priority inversion - multithreading

Short version: How do I release multiple locks from a single thread, without being preempted halfway through?
I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
for i in workers_to_wake:
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
...do some work...
main_semaphore.increment() # report done with step
All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.
I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?

TL; DR
If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.
Context
First I need to summarize the broader context in our discussion...
You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.
The problem
Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.
So what can we do instead? The problems you need to solve are:
Precisely timing the start of all required workers.
Prodding the workers that actually need to run in the next frame.
Options
The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.
But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?
Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.
However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.
A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.
A solution
Applying the above, your pseudo-code would then look something like this:
def main_thread(n):
main_event = event()
for i = 1 to n:
worker_scheduled[i] = False
spawn_thread(worker_thread, i)
main_barrier = barrier(n+1)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_scheduled[i] = True
main_event.set()
main_barrier.enter() # wait for all workers
main_event.reset()
def worker_thread(i):
while True:
main_event.wait()
if worker_scheduled[i]:
worker_scheduled[i] = False
...do some work...
main_barrier.enter() # report finished for this frame.
main_event.reset() # to catch the case that a worker is scheduled before the main thread
Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.
I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.

It is not possible when you use multiple synchronization objects (semaphores) when wake-up algorithm complexity is O(n). There are few ways how to solve it though.
release all at once
I'm not sure whether Python has the necessary method (is your question Python-specific?), but generally, semaphores have operations with argument specifying the number to decrements/increments. Thus, you just put all your threads on the same semaphore and wake them all together. Similar approach is to use conditional variable and notify all.
event loops
If you still want to to be able to control each thread individually but like the approach with one-to-many notification, try libraries for asynchronous I/O like libuv (and its Python counterpart). Here, you can create one single event that wakes all the threads at once and also create for each thread its individual event, then just wait on both (or more) event objects in event loops in each thread.
Another library is pevents which implements WaitForMultipleObjects on top of pthreads' conditional variables.
delegate waking up
Another approach is to replace your O(n) algorithm with tree-like algorithm ( O(log n) ) where each thread wakes up only fixed number of other threads but delegates them to wake-up others. In the edge case, main thread can wake up only one other thread which will wake-up everyone else or start the chain-reaction. It can be useful if you want to reduce latency for the main thread at expense of wake-up latenies of other threads.

Reader/Writer Lock
The solution I would normally use on POSIX systems for a one to many relationship is a reader/writer lock. It is a surprise to me that they aren't a complete universal, but most languages either implement a version, or at least have a package available to implement them on whatever primitives exist, for example, python's prwlock:
from prwlock import RWLock
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_lock = RWLock()
while True:
main_lock.acquire_write()
...do some work...
workers_to_wake = foo()
# The above acquire could be moved as low as here,
# depending on how independent the above processing is..
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
main_lock.release()
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
main_lock.acquire_read()
...do some work...
main_lock.release() # report done with step
Barriers
Barriers seem like Python's closest intended built-in mechanism to hold up all the threads until they are all alerted, but:
They are a pretty unusual solution, so they would make your code/experience harder to translate to other languages.
I wouldn't like to use them for this case where the number of threads to wake keeps changing. Given that your n sounds small, I would be tempted to use constant Barrier(n) and notify all threads to check if they are running this cycle. But:
I would be concerned that using a barrier would backfire since any of the threads being held up by something external will hold them all up and even a scheduler with resource dependency boosting might miss this relationship. Needing all n to reach the barrier could only make this worse.

Peter Brittain's solution, plus Anton's suggestion of a "tree-like wakeup", led me to another solution: Chained wakeups. Basically, rather than the main thread doing all the wakeups, it only wakes up one thread; and then each thread is then responsible for waking up the next one. The elegant bit here is that there's only ever one suspended thread ready to run, so threads rarely end up switching cores. In fact, this works fine with strict processor affinities, even if one of the worker threads shares affinity with the main thread.
The other thing I did was to use an atomic counter that worker threads decrement before sleeping; that way, only the last one wakes the main thread, so there's also no chance of the main thread being woken several times just to do more semaphore waiting.
workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the first worker
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore[i].decrement() # wait to be woken
if workers_to_wake.len() > 0: # more pending wakeups
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the next worker
...do some work...
if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
main_semaphore.increment() # report all done with step

Related

Java Thread Live Lock

I have an interesting problem related to Java thread live lock. Here it goes.
There are four global locks - L1,L2,L3,L4
There are four threads - T1, T2, T3, T4
T1 requires locks L1,L2,L3
T2 requires locks L2
T3 required locks L3,L4
T4 requires locks L1,L2
So, the pattern of the problem is - Any of the threads can run and acquire the locks in any order. If any of the thread detects that a lock which it needs is not available, it release all other locks it had previously acquired waits for a fixed time before retrying again. The cycle repeats giving rise to a live lock condition.
So, to solve this problem, I have two solutions in mind
1) Let each thread wait for a random period of time before retrying.
OR,
2) Let each thread acquire all the locks in a particular order ( even if a thread does not require all the
locks)
I am not convinced that these are the only two options available to me. Please advise.
Have all the threads enter a single mutex-protected state-machine whenever they require and release their set of locks. The threads should expose methods that return the set of locks they require to continue and also to signal/wait for a private semaphore signal. The SM should contain a bool for each lock and a 'Waiting' queue/array/vector/list/whatever container to store waiting threads.
If a thread enters the SM mutex to get locks and can immediately get its lock set, it can reset its bool set, exit the mutex and continue on.
If a thread enters the SM mutex and cannot immediately get its lock set, it should add itself to 'Waiting', exit the mutex and wait on its private semaphore.
If a thread enters the SM mutex to release its locks, it sets the lock bools to 'return' its locks and iterates 'Waiting' in an attempt to find a thread that can now run with the set of locks available. If it finds one, it resets the bools appropriately, removes the thread it found from 'Waiting' and signals the 'found' thread semaphore. It then exits the mutex.
You can twiddle with the algorithm that you use to match up the available set lock bools with waiting threads as you wish. Maybe you should release the thread that requires the largest set of matches, or perhaps you would like to 'rotate' the 'Waiting' container elements to reduce starvation. Up to you.
A solution like this requires no polling, (with its performance-sapping CPU use and latency), and no continual aquire/release of multiple locks.
It's much easier to develop such a scheme with an OO design. The methods/member functions to signal/wait the semaphore and return the set of locks needed can usually be stuffed somewhere in the thread class inheritance chain.
Unless there is a good reason (performance wise) not to do so,
I would unify all locks to one lock object.
This is similar to solution 2 you suggested, only more simple in my opinion.
And by the way, not only is this solution more simple and less bug proned,
The performance might be better than solution 1 you suggested.
Personally, I have never heard of Option 1, but I am by no means an expert on multithreading. After thinking about it, it sounds like it will work fine.
However, the standard way to deal with threads and resource locking is somewhat related to Option 2. To prevent deadlocks, resources need to always be acquired in the same order. For example, if you always lock the resources in the same order, you won't have any issues.
Go with 2a) Let each thread acquire all of the locks that it needs (NOT all of the locks) in a particular order; if a thread encounters a lock that isn't available then it releases all of its locks
As long as threads acquire their locks in the same order you can't have deadlock; however, you can still have starvation (a thread might run into a situation where it keeps releasing all of its locks without making forward progress). To ensure that progress is made you can assign priorities to threads (0 = lowest priority, MAX_INT = highest priority) - increase a thread's priority when it has to release its locks, and reduce it to 0 when it acquires all of its locks. Put your waiting threads in a queue, and don't start a lower-priority thread if it needs the same resources as a higher-priority thread - this way you guarantee that the higher-priority threads will eventually acquire all of their locks. Don't implement this thread queue unless you're actually having problems with thread starvation, though, because it's probably less efficient than just letting all of your threads run at once.
You can also simplify things by implementing omer schleifer's condense-all-locks-to-one solution; however, unless threads other than the four you've mentioned are contending for these resources (in which case you'll still need to lock the resources from the external threads), you can more efficiently implement this by removing all locks and putting your threads in a circular queue (so your threads just keep running in the same order).

Sleeping a PThread other than the one doing the calling

So I have a bunch of pthreads, where one is the "main" thread and determines if a worker thread should be running or sleeping. But the POSIX definition for sleep says that The sleep() function shall cause the calling thread to be suspended from execution...
Obviously I could do something clumsy like have each worker thread check to see if a flag is set, but I'm looking for something a little better. I'm hoping I'm missing something obvious, because this is throwing a wrench in my plans.
If you're hacking Cilk anyway, I guess you can do whatever you want
How about having each pthread acquire a semaphore unit before dequeueing, (or stealing), a work object, and releasing it after doing the work? There may be a little latency, sure, but the number of threads available to do work will match the number of units signaled to the semaphore. To reduce the number of available threads by N from your control thread, wait for and acquire N units, so choking off N work threads. To start 'em again, signal N units.
Would this work for your system?
Rgds,
Martin

Advantages of using condition variables over mutex

I was wondering what is the performance benefit of using condition variables over mutex locks in pthreads.
What I found is : "Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met. This can be very resource consuming since the thread would be continuously busy in this activity. A condition variable is a way to achieve the same goal without polling." (https://computing.llnl.gov/tutorials/pthreads)
But it also seems that mutex calls are blocking (unlike spin-locks). Hence if a thread (T1) fails to get a lock because some other thread (T2) has the lock, T1 is put to sleep by the OS, and is woken up only when T2 releases the lock and the OS gives T1 the lock. The thread T1 does not really poll to get the lock. From this description, it seems that there is no performance benefit of using condition variables. In either case, there is no polling involved. The OS anyway provides the benefit that the condition-variable paradigm can provide.
Can you please explain what actually happens.
A condition variable allows a thread to be signaled when something of interest to that thread occurs.
By itself, a mutex doesn't do this.
If you just need mutual exclusion, then condition variables don't do anything for you. However, if you need to know when something happens, then condition variables can help.
For example, if you have a queue of items to work on, you'll have a mutex to ensure the queue's internals are consistent when accessed by the various producer and consumer threads. However, when the queue is empty, how will a consumer thread know when something is in there for it to work on? Without something like a condition variable it would need to poll the queue, taking and releasing the mutex on each poll (otherwise a producer thread could never put something on the queue).
Using a condition variable lets the consumer find that when the queue is empty it can just wait on the condition variable indicating that the queue has had something put into it. No polling - that thread does nothing until a producer puts something in the queue, then signals the condition that the queue has a new item.
You're looking for too much overlap in two separate but related things: a mutex and a condition variable.
A common implementation approach for a mutex is to use a flag and a queue. The flag indicates whether the mutex is held by anyone (a single-count semaphore would work too), and the queue tracks which threads are in line waiting to acquire the mutex exclusively.
A condition variable is then implemented as another queue bolted onto that mutex. Threads that got in line to wait to acquire the mutex can—usually once they have acquired it—volunteer to get out of the front of the line and get into the condition queue instead. At this point, you have two separate sets of waiters:
Those waiting to acquire the mutex exclusively
Those waiting for the condition variable to be signaled
When a thread holding the mutex exclusively signals the condition variable, for which we'll assume for now that it's a singular signal (unleashing no more than one waiting thread) and not a broadcast (unleashing all the waiting threads), the first thread in the condition variable queue gets shunted back over into the front (usually) of the mutex queue. Once the thread currently holding the mutex—usually the thread that signaled the condition variable—relinquishes the mutex, the next thread in the mutex queue can acquire it. That next thread in line will have been the one that was at the head of the condition variable queue.
There are many complicated details that come into play, but this sketch should give you a feel for the structures and operations in play.
If you are looking for performance, then start reading about "non blocking / non locking" thread synchronization algorithms. They are based upon atomic operations, which gcc is kind enough to provide. Lookup gcc atomic operations. Our tests showed we could increment a global value with multiple threads using atomic operation magnitudes faster than locking with a mutex. Here is some sample code that shows how to add items to and from a linked list from multiple threads at the same time without locking.
For sleeping and waking threads, signals are much faster than conditions. You use pthread_kill to send the signal, and sigwait to sleep the thread. We tested this too with the same kind of performance benefits. Here is some example code.

Multithreading, when to yield versus sleep

To clarify terminology, yield is when thread gives up its time slice.
My platform of interest is POSIX threads, but I think the question is general.
Suppose I have consumer/producer pattern. If I want to throttle either consumer or producer, which is better to use, sleep or yield? I am mostly interested in efficiency of using either function.
The "right" way to code a producer / consumer is to have the consumer wait for the producer's data. You can achieve this by using a synchronization object such as a Mutex. The consumer will Wait on the mutex, which blocks it from executing until data is available. In turn, the producer will signal the mutex when data is available, which will wake up the consumer thread so it can begin processing. This is more efficient than sleep in terms of both:
CPU utilization (no cycles are wasted), and
Run Time (execution begins as soon as data is available, not when a thread is scheduled to wake up).
That said, here is an analysis of yield vs sleep that you asked for. You may need to use such a scheme if for some reason waiting for output is not feasible:
It depends how much traffic you are receiving - if data is constantly being received and processed, you might consider doing a yield. However in most cases this will result in a "busy" loop that spends most of its time needlessly waking up the thread to check if anything is ready.
You will probably want to either sleep for a short period of time (perhaps for less than a second, using usleep) OR even better use a synchronization object such as a mutex to signal that data is available.
sleep and yield are not the same. When calling sleep the process/thread gives CPU to another process/thread for the given amount of time.
yield relinquishes the CPU to another thread, but may return immediately if there are no other threads that waits for CPU.
So if you want to throttle, for example when streaming data at regular intervals, then sleep or nanosleep are the functions to use.
If synchronization between producer/consumer is needed, you should use a mutex/conditional wait.
One good reason to sleep instead of yield is when there is too much contention at a specific critical section. Lets say for example you try to acquire two locks and there is a lot of contention on both locks. Here you can use sleep to employ an exponential back off. This would allow each failed attempt to pseudo randomly back off to allow other thread to succeed.
Yielding in this situation doesn't really help as much because the prospect of a random back off can increase likelihood that thread starvation would not occur.
Edit: Though I know this isn't necessarily java specific. Java's implementation of Thread.sleep(0) has the same effect of Thread.yield() At that point its more of a matter of style.
In java, some JVM implementations treat Thread.yield() as no-op, meaning it may have no effect. Calling Thread.sleep() does not necessarily mean that scheduler should yield the CPU to another thread; this is implementation dependent too. It may context-switch to another thread that is waiting or it may not in order to amortize the cost associated with context-switch.

pthreads - how to parallelize a job

I need to parallelize a simple password cracker, for using it on a n-processor system. My idea is to create n threads and to feed them more and more job as they finish.
What is the best way to know when a thread has finished? A mutex? Isn't expensive checking this mutex constantly while other threads are running?
You can have a simple queue structure - use any data structure you like - and then just use a mutex when you add/remove items from it.
Provided your threads grab the work they need to do in big enough "chunks", then there will be very little contention on the mutex, so very little overhead.
For example, if each thread was to grab approximately 1 second of work at a time and work independently for 1 second, then there would be very few operations on the mutex.
The threads could exit when they had no more work; the main thread could then wait using pthread_join.
Use message queues between the threads :-
Master -> Process (saying go with this).
Process -> Master (saying I'm done - give me more, or, I've found the result!)
Using this, the thread only closes down when the system does - otherwise it's either processing data or waiting on a message queue.
This way, the MCP (I've always wanted to say that!) simply processes messages and hands jobs out to threads that are waiting for more work.
This may be more efficient that creating and destroying threads all the time.
Normally you use a "condition variable" for this kind of thing where you want to wait for an asynchronous job to finish.
Condition variables are basically mutex-protected, simple signals. Pthread has condition variables (see e.g. the pthread_cond_create(...) function).

Resources