Sleeping a PThread other than the one doing the calling

Sleeping a PThread other than the one doing the calling - multithreading

So I have a bunch of pthreads, where one is the "main" thread and determines if a worker thread should be running or sleeping. But the POSIX definition for sleep says that The sleep() function shall cause the calling thread to be suspended from execution...
Obviously I could do something clumsy like have each worker thread check to see if a flag is set, but I'm looking for something a little better. I'm hoping I'm missing something obvious, because this is throwing a wrench in my plans.

If you're hacking Cilk anyway, I guess you can do whatever you want
How about having each pthread acquire a semaphore unit before dequeueing, (or stealing), a work object, and releasing it after doing the work? There may be a little latency, sure, but the number of threads available to do work will match the number of units signaled to the semaphore. To reduce the number of available threads by N from your control thread, wait for and acquire N units, so choking off N work threads. To start 'em again, signal N units.
Would this work for your system?
Rgds,
Martin

Related

Releasing multiple locks without causing priority inversion

Short version: How do I release multiple locks from a single thread, without being preempted halfway through?
I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
for i in workers_to_wake:
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
...do some work...
main_semaphore.increment() # report done with step
All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.
I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?

TL; DR
If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.
Context
First I need to summarize the broader context in our discussion...
You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.
The problem
Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.
So what can we do instead? The problems you need to solve are:
Precisely timing the start of all required workers.
Prodding the workers that actually need to run in the next frame.
Options
The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.
But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?
Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.
However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.
A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.
A solution
Applying the above, your pseudo-code would then look something like this:
def main_thread(n):
main_event = event()
for i = 1 to n:
worker_scheduled[i] = False
spawn_thread(worker_thread, i)
main_barrier = barrier(n+1)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_scheduled[i] = True
main_event.set()
main_barrier.enter() # wait for all workers
main_event.reset()
def worker_thread(i):
while True:
main_event.wait()
if worker_scheduled[i]:
worker_scheduled[i] = False
...do some work...
main_barrier.enter() # report finished for this frame.
main_event.reset() # to catch the case that a worker is scheduled before the main thread
Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.
I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.

It is not possible when you use multiple synchronization objects (semaphores) when wake-up algorithm complexity is O(n). There are few ways how to solve it though.
release all at once
I'm not sure whether Python has the necessary method (is your question Python-specific?), but generally, semaphores have operations with argument specifying the number to decrements/increments. Thus, you just put all your threads on the same semaphore and wake them all together. Similar approach is to use conditional variable and notify all.
event loops
If you still want to to be able to control each thread individually but like the approach with one-to-many notification, try libraries for asynchronous I/O like libuv (and its Python counterpart). Here, you can create one single event that wakes all the threads at once and also create for each thread its individual event, then just wait on both (or more) event objects in event loops in each thread.
Another library is pevents which implements WaitForMultipleObjects on top of pthreads' conditional variables.
delegate waking up
Another approach is to replace your O(n) algorithm with tree-like algorithm ( O(log n) ) where each thread wakes up only fixed number of other threads but delegates them to wake-up others. In the edge case, main thread can wake up only one other thread which will wake-up everyone else or start the chain-reaction. It can be useful if you want to reduce latency for the main thread at expense of wake-up latenies of other threads.

Reader/Writer Lock
The solution I would normally use on POSIX systems for a one to many relationship is a reader/writer lock. It is a surprise to me that they aren't a complete universal, but most languages either implement a version, or at least have a package available to implement them on whatever primitives exist, for example, python's prwlock:
from prwlock import RWLock
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_lock = RWLock()
while True:
main_lock.acquire_write()
...do some work...
workers_to_wake = foo()
# The above acquire could be moved as low as here,
# depending on how independent the above processing is..
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
main_lock.release()
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
main_lock.acquire_read()
...do some work...
main_lock.release() # report done with step
Barriers
Barriers seem like Python's closest intended built-in mechanism to hold up all the threads until they are all alerted, but:
They are a pretty unusual solution, so they would make your code/experience harder to translate to other languages.
I wouldn't like to use them for this case where the number of threads to wake keeps changing. Given that your n sounds small, I would be tempted to use constant Barrier(n) and notify all threads to check if they are running this cycle. But:
I would be concerned that using a barrier would backfire since any of the threads being held up by something external will hold them all up and even a scheduler with resource dependency boosting might miss this relationship. Needing all n to reach the barrier could only make this worse.

Peter Brittain's solution, plus Anton's suggestion of a "tree-like wakeup", led me to another solution: Chained wakeups. Basically, rather than the main thread doing all the wakeups, it only wakes up one thread; and then each thread is then responsible for waking up the next one. The elegant bit here is that there's only ever one suspended thread ready to run, so threads rarely end up switching cores. In fact, this works fine with strict processor affinities, even if one of the worker threads shares affinity with the main thread.
The other thing I did was to use an atomic counter that worker threads decrement before sleeping; that way, only the last one wakes the main thread, so there's also no chance of the main thread being woken several times just to do more semaphore waiting.
workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the first worker
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore[i].decrement() # wait to be woken
if workers_to_wake.len() > 0: # more pending wakeups
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the next worker
...do some work...
if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
main_semaphore.increment() # report all done with step

Java Thread Live Lock

I have an interesting problem related to Java thread live lock. Here it goes.
There are four global locks - L1,L2,L3,L4
There are four threads - T1, T2, T3, T4
T1 requires locks L1,L2,L3
T2 requires locks L2
T3 required locks L3,L4
T4 requires locks L1,L2
So, the pattern of the problem is - Any of the threads can run and acquire the locks in any order. If any of the thread detects that a lock which it needs is not available, it release all other locks it had previously acquired waits for a fixed time before retrying again. The cycle repeats giving rise to a live lock condition.
So, to solve this problem, I have two solutions in mind
1) Let each thread wait for a random period of time before retrying.
OR,
2) Let each thread acquire all the locks in a particular order ( even if a thread does not require all the
locks)
I am not convinced that these are the only two options available to me. Please advise.

Have all the threads enter a single mutex-protected state-machine whenever they require and release their set of locks. The threads should expose methods that return the set of locks they require to continue and also to signal/wait for a private semaphore signal. The SM should contain a bool for each lock and a 'Waiting' queue/array/vector/list/whatever container to store waiting threads.
If a thread enters the SM mutex to get locks and can immediately get its lock set, it can reset its bool set, exit the mutex and continue on.
If a thread enters the SM mutex and cannot immediately get its lock set, it should add itself to 'Waiting', exit the mutex and wait on its private semaphore.
If a thread enters the SM mutex to release its locks, it sets the lock bools to 'return' its locks and iterates 'Waiting' in an attempt to find a thread that can now run with the set of locks available. If it finds one, it resets the bools appropriately, removes the thread it found from 'Waiting' and signals the 'found' thread semaphore. It then exits the mutex.
You can twiddle with the algorithm that you use to match up the available set lock bools with waiting threads as you wish. Maybe you should release the thread that requires the largest set of matches, or perhaps you would like to 'rotate' the 'Waiting' container elements to reduce starvation. Up to you.
A solution like this requires no polling, (with its performance-sapping CPU use and latency), and no continual aquire/release of multiple locks.
It's much easier to develop such a scheme with an OO design. The methods/member functions to signal/wait the semaphore and return the set of locks needed can usually be stuffed somewhere in the thread class inheritance chain.

Unless there is a good reason (performance wise) not to do so,
I would unify all locks to one lock object.
This is similar to solution 2 you suggested, only more simple in my opinion.
And by the way, not only is this solution more simple and less bug proned,
The performance might be better than solution 1 you suggested.

Personally, I have never heard of Option 1, but I am by no means an expert on multithreading. After thinking about it, it sounds like it will work fine.
However, the standard way to deal with threads and resource locking is somewhat related to Option 2. To prevent deadlocks, resources need to always be acquired in the same order. For example, if you always lock the resources in the same order, you won't have any issues.

Go with 2a) Let each thread acquire all of the locks that it needs (NOT all of the locks) in a particular order; if a thread encounters a lock that isn't available then it releases all of its locks
As long as threads acquire their locks in the same order you can't have deadlock; however, you can still have starvation (a thread might run into a situation where it keeps releasing all of its locks without making forward progress). To ensure that progress is made you can assign priorities to threads (0 = lowest priority, MAX_INT = highest priority) - increase a thread's priority when it has to release its locks, and reduce it to 0 when it acquires all of its locks. Put your waiting threads in a queue, and don't start a lower-priority thread if it needs the same resources as a higher-priority thread - this way you guarantee that the higher-priority threads will eventually acquire all of their locks. Don't implement this thread queue unless you're actually having problems with thread starvation, though, because it's probably less efficient than just letting all of your threads run at once.
You can also simplify things by implementing omer schleifer's condense-all-locks-to-one solution; however, unless threads other than the four you've mentioned are contending for these resources (in which case you'll still need to lock the resources from the external threads), you can more efficiently implement this by removing all locks and putting your threads in a circular queue (so your threads just keep running in the same order).

Mutex lock: what does "blocking" mean?

I've been reading up on multithreading and shared resources access and one of the many (for me) new concepts is the mutex lock. What I can't seem to find out is what is actually happening to the thread that finds a "critical section" is locked. It says in many places that the thread gets "blocked", but what does that mean? Is it suspended, and will it resume when the lock is lifted? Or will it try again in the next iteration of the "run loop"?
The reason I ask, is because I want to have system supplied events (mouse, keyboard, etc.), which (apparantly) are delivered on the main thread, to be handled in a very specific part in the run loop of my secondary thread. So whatever event is delivered, I queue in my own datastructure. Obviously, the datastructure needs a mutex lock because it's being modified by both threads. The missing puzzle-piece is: what happens when an event gets delivered in a function on the main thread, I want to queue it, but the queue is locked? Will the main thread be suspended, or will it just jump over the locked section and go out of scope (losing the event)?

Blocked means execution gets stuck there; generally, the thread is put to sleep by the system and yields the processor to another thread. When a thread is blocked trying to acquire a mutex, execution resumes when the mutex is released, though the thread might block again if another thread grabs the mutex before it can.
There is generally a try-lock operation that grab the mutex if possible, and if not, will return an error. But you are eventually going to have to move the current event into that queue. Also, if you delay moving the events to the thread where they are handled, the application will become unresponsive regardless.
A queue is actually one case where you can get away with not using a mutex. For example, Mac OS X (and possibly also iOS) provides the OSAtomicEnqueue() and OSAtomicDequeue() functions (see man atomic or <libkern/OSAtomic.h>) that exploit processor-specific atomic operations to avoid using a lock.
But, why not just process the events on the main thread as part of the main run loop?

The simplest way to think of it is that the blocked thread is put in a wait ("sleeping") state until the mutex is released by the thread holding it. At that point the operating system will "wake up" one of the threads waiting on the mutex and let it acquire it and continue. It's as if the OS simply puts the blocked thread on a shelf until it has the thing it needs to continue. Until the OS takes the thread off the shelf, it's not doing anything. The exact implementation -- which thread gets to go next, whether they all get woken up or they're queued -- will depend on your OS and what language/framework you are using.

Too late to answer but I may facilitate the understanding. I am talking more from implementation perspective rather than theoretical texts.
The word "blocking" is kind of technical homonym. People may use it for sleeping or mere waiting. The term has to be understood in context of usage.
Blocking means Waiting - Assume on an SMP system a thread B wants to acquire a spinlock held by some other thread A. One of the mechanisms is to disable preemption and keep spinning on the processor unless B gets it. Another mechanism probably, an efficient one, is to allow other threads to use processor, in case B does not gets it in easy attempts. Therefore we schedule out thread B (as preemption is enabled) and give processor to some other thread C. In this case thread B just waits in the scheduler's queue and comes back with its turn. Understand that B is not sleeping just waiting rather passively instead of busy-wait and burning processor cycles. On BSD and Solaris systems there are data-structures like turnstiles to implement this situation.
Blocking means Sleeping - If the thread B had instead made system call like read() waiting data from network socket, it cannot proceed until it gets it. Therefore, some texts casually use term blocking as "... blocked for I/O" or "... in blocking system call". Actually, thread B is rather sleeping. There are specific data-structures known as sleep queues - much like luxury waiting rooms on air-ports :-). The thread will be woken up when OS detects availability of data, much like an attendant of the waiting room.

Blocking means just that. It is blocked. It will not proceed until able. You don't say which language you're using, but most languages/libraries have lock objects where you can "attempt" to take the lock and then carry on and do something different depending on whether you succeeded or not.
But in, for example, Java synchronized blocks, your thread will stall until it is able to acquire the monitor (mutex, lock). The java.util.concurrent.locks.Lock interface describes lock objects which have more flexibility in terms of lock acquisition.

Multithreading, when to yield versus sleep

To clarify terminology, yield is when thread gives up its time slice.
My platform of interest is POSIX threads, but I think the question is general.
Suppose I have consumer/producer pattern. If I want to throttle either consumer or producer, which is better to use, sleep or yield? I am mostly interested in efficiency of using either function.

The "right" way to code a producer / consumer is to have the consumer wait for the producer's data. You can achieve this by using a synchronization object such as a Mutex. The consumer will Wait on the mutex, which blocks it from executing until data is available. In turn, the producer will signal the mutex when data is available, which will wake up the consumer thread so it can begin processing. This is more efficient than sleep in terms of both:
CPU utilization (no cycles are wasted), and
Run Time (execution begins as soon as data is available, not when a thread is scheduled to wake up).
That said, here is an analysis of yield vs sleep that you asked for. You may need to use such a scheme if for some reason waiting for output is not feasible:
It depends how much traffic you are receiving - if data is constantly being received and processed, you might consider doing a yield. However in most cases this will result in a "busy" loop that spends most of its time needlessly waking up the thread to check if anything is ready.
You will probably want to either sleep for a short period of time (perhaps for less than a second, using usleep) OR even better use a synchronization object such as a mutex to signal that data is available.

sleep and yield are not the same. When calling sleep the process/thread gives CPU to another process/thread for the given amount of time.
yield relinquishes the CPU to another thread, but may return immediately if there are no other threads that waits for CPU.
So if you want to throttle, for example when streaming data at regular intervals, then sleep or nanosleep are the functions to use.
If synchronization between producer/consumer is needed, you should use a mutex/conditional wait.

One good reason to sleep instead of yield is when there is too much contention at a specific critical section. Lets say for example you try to acquire two locks and there is a lot of contention on both locks. Here you can use sleep to employ an exponential back off. This would allow each failed attempt to pseudo randomly back off to allow other thread to succeed.
Yielding in this situation doesn't really help as much because the prospect of a random back off can increase likelihood that thread starvation would not occur.
Edit: Though I know this isn't necessarily java specific. Java's implementation of Thread.sleep(0) has the same effect of Thread.yield() At that point its more of a matter of style.

In java, some JVM implementations treat Thread.yield() as no-op, meaning it may have no effect. Calling Thread.sleep() does not necessarily mean that scheduler should yield the CPU to another thread; this is implementation dependent too. It may context-switch to another thread that is waiting or it may not in order to amortize the cost associated with context-switch.

prevent linux thread from being interrupted by scheduler

How do you tell the thread scheduler in linux to not interrupt your thread for any reason? I am programming in user mode. Does simply locking a mutex acomplish this? I want to prevent other threads in my process from being scheduled when a certain function is executing. They would block and I would be wasting cpu cycles with context switches. I want any thread executing the function to be able to finish executing without interruption even if the threads' timeslice is exceeded.

How do you tell the thread scheduler in linux to not interrupt your thread for any reason?
Can't really be done, you need a real time system for that. The closes thing you'll get with linux is to
set the scheduling policy to a realtime scheduler, e.g. SCHED_FIFO, and also set the PTHREAD_EXPLICIT_SCHED attribute. See e.g. here , even now though, e.g. irq handlers and other other stuff will interrupt your thread and run.
However, if you only care about the threads in your own process not being able to do anything, then yes, having them block on a mutex your running thread holds is sufficient.
The hard part is to coordinate all the other threads to grab that mutex whenever your thread needs to do its thing.

You should architect your sw so you're not dependent on the scheduler doing the "right" thing from your app's point of view. The scheduler is complicated. It will do what it thinks is best.
Context switches are cheap. You say
I would be wasting cpu cycles with context switches.
but you should not look at it that way. Use the multi-threaded machinery of mutexes and blocked / waiting processes. The machinery is there for you to use...

You can't. If you could what would prevent your thread from never releasing the request and starving other threads.
The best you can do is set your threads priority so that the scheduler will prefer it over lower priority threads.

Why not simply let the competing threads block, then the scheduler will have nothing left to schedule but your living thread? Why complicate the design second guessing the scheduler?

Look into real time scheduling under Linux. I've never done it, but if you indeed do NEED this this is as close as you can get in user application code.
What you seem to be scared of isn't really that big of a deal though. You can't stop the kernel from interrupting your programs for real interrupts or of a higher priority task wants to run, but with regular scheduling the kernel does uses it's own computed priority value which pretty much handles most of what you are worried about. If thread A is holding resource X exclusively (X could be a lock) and thread B is waiting on resource X to become available then A's effective priority will be at least as high as B's priority. It also takes into account if a process is using up lots of cpu or if it is spending lots of time sleeping to compute the priority. Of course, the nice value goes in there too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string