Ways to detect deadlock in a live application - multithreading

What are the ways to detect deadlocks in a live multi-threaded application?
If we found there is a deadlock, are there any ways to resolve it, without taking down/restarting the application?

There are two popular ways to detect deadlocks.
One is to have threads set checkpoints. For example, if you have a thread that has a work loop, you set a timer at the beginning of doing work that's set for longer than you think the work could possibly take. If the timer fires, you assume the thread is deadlocked. When the work is done, you cancel the timer.
Another (sometimes used in combination) is to have things that a thread might block on track what other resources a thread might hold. This can directly detect an attempt to acquire one lock while holding another one when other threads have acquired those locks in the opposite order.
This can even detect deadlock risk without the deadlock actually occurring. If one thread acquires lock A then B and another acquires lock B then A, there is no deadlock unless they overlap. But this method can detect it.
Advanced deadlock detection is typically only used during debugging. Other than coding the application to check each blocking lock for a possible deadlock and knowing what to do if it happens, the only thing you can do after a deadlock is tear the application down. You can't release locks blindly because the resources they protect may be in an inconsistent state.
Sometimes you deliberately write code that you know can deadlock and specifically code it to avoid the problem. For example, if you know lots of threads take lock A and then try to acquire lock B, and some other thread needs to do the reverse, you can code it do a non-blocking attempt to lock B and release lock A if it fails.
Typically, it's more useful to spend your effort making deadlocks impossible rather than making the code detect and work around deadlocks.

Python has a feature called the faulthandler that's very useful for dealing with deadlocks:
import faulthandler
faulthandler.register(signal.SIGUSR1)
If you're using C++ or any compiler that uses glibc, you can use the backtrace() functions in execinfo.h to print a stacktrace and exit gracefully when you get a signal. You can take a deadlocked program, send it a signal and get a list of all the threads.
In Java, use jstack <pid> on the stuck process.

Related

Is there any reason to use a regular lock over a recursive lock?

When a thread tries to acquire a recursive lock again that it already holds, rlock.acquire() allows the thread to continue and does not block the thread.
When, on the other hand, a thread tries to acquire a regular lock that it already holds then the thread is then just stuck in a deadlock.
The second case seems to me like just a source of trouble since it is a situation that cannot be easily recovered from (the thread is just stuck on the lock.acquire()) and that is kinda hard to diagnose (no exception is thrown or anything, the thread is just stuck).
I have seen it quite a few times so far that someone actually wanted to use an RLock but instead used a regular Lock and spent a while debugging that problem. While on the other hand I never encountered a situation where a Lock would have actually been better. It could arguably be used when there is a really critical part of the code that should not be accessed by the same thread twice at a time, but for that to happen the code inside that critical part would need to call itself, which would be something that should be quite obvious to the programmer.
So, is there any case where an Lock is better than an RLock? And if not, should language designers keep providing the regular Lock at all?
Assuming these are Python lock objects the documentation shows that they are quite different. The main differences between the two are:
A Lock can be released by any thread not just the thread that acquired it
An Rlock can only be released by the thread that acquired
An Rlock must be released once for each time it is acquired by the thread
So a Lock allows you to build threading schemes where the lock is acquired in one thread but released in another thread. One example might be a pipeline of threads processing a piece of work, the work distributer gets the lock but it's released by the last thread in the pipeline.

Make thread wait for condition but allow thread to remain usable while waiting or listening for a signal

Given a situation where thread A had to dispatch work to thread B, is there any synchronisation mechanism that allows thread A to not return, but remain usable for other tasks, until thread B is done, of which then thread A can return?
This is not language specific, but simple c language would be a great choice in responding to this.
This could be absolutely counterintuitive; it actually sounds as such, but I have to ask before presuming...
Please Note This is a made up hypothetical situation that I'm interested in. I am not looking for a solution to an existing problem, so alternative concurrency solutions are completely pointless. I have no code for it, and if I were in it I can think of a few alternative code engineering solutions to avoid this setup. I just wish to know if a thread can be usable, in some way, while waiting for a signal from another thread, and what synchronisation mechanism to use for that.
UPDATE
As I mentioned above, I know how to synchronise threads etc. Im only interested in the situation that I have presented here. Mutexes, semaphores and locks all kinds of mechanisms will all synchronise access to resources, synchronise order of events, synchronise all kinds of concurrently issues, yes. But Im not interested in how to do it properly. I just have this made up situation that I wish to know if it can be addressed with a mechanism as described prior.
UPDATE 2
It seems I have opened up a portal for people that think they are experts in concurrency to teleport and lecture at chance how they think the rest of world does not know how threading works. I simply asked if there is a mechanism for this situation, not a work around solution, not 'the proper way to synchronise', not a better way to do it. I already know what I would do and never be in this made up situation. It's simply hypothetical.
After much research, thought, and overview, I have come to the conclusion that its like asking:
If a calculator has the ability for me simply enter a series of 5 digits and automatically get their sum on the screen.
No, it does not have such a mode ready. But I can still get the sum with a few extra clicks using the plus and eventually the equal button.
If i really wanted a thread that can continue while listening for a condition of some sort, I could easily implement a personal class or object around the OS/kernel/SDK thread or whatever and make use of that.
• So at a low level, my answer is no, there is no such mechanism •
If a thread is waiting, then it's waiting. If it can continue executing then it is not really 'waiting', in the concurrency meaning of waiting. Otherwise there would be some other term for this state (Alert Waiting, anyone?). This is not to say it is not possible, just not with one simple low level predefined mechanism similar to a mutex or semaphore etc. One could wrap the required functionality in some class or object etc.
Having said that, there are Interrupts and Interrupt handlers, which come close to addressing this situation. However, an interrupt has to be defined, with its handler. The interrupts may actually be running on another thread (not to say a thread per interrupt). So a number of objects are involved here.
You have a misunderstanding about how mutexes are typically used.
If you want to do some work, you acquire the mutex to figure out what work you need to do. You do this because "what work you need to do" is shared between the thread that decide what work needed to be done and the thread that's going to do the work. But then you release the mutex that protects "what work you need to do" while you do the work.
Then, when you finish the work, you acquire the mutex that protects your report that the work is done. This is needed because the status of the work is shared with other threads. You set that status to "done" and then you release the mutex.
Notice that no thread holds the mutex for very long, just for the microscopic fraction of a second it needs to check on or modify shared state. So to see if work is done, you can acquire the mutex that protects the reporting of the status of that work, check the status, and then release the mutex. The thread doing the work will not hold that mutex for longer than the tiny fraction of a second it needs to change that status.
If you're holding mutexes so long that you worry at all about waiting for them to be released, you're either doing something wrong or using mutexes in a very atypical way.
So use a mutex to protect the status of the work. If you need to wait for work to be done, also use a condition variable. Only hold that mutex while changing, or checking, the status of the work.
But, If a thread attempts to acquire an already acquired mutex, that thread will be forced to wait until the thread that originally acquired the mutex releases it. So, while that thread is waiting, can it actually be usable. This is where my question is.
If you consider any case where one thread might slow another thread down to be "waiting", then you can never avoid waiting. All that has to happen is one thread accesses memory and that might slow another thread down. So what do you do, never access memory?
When we talk about one thread "waiting" for another, what we mean is waiting for the thread to do actual work. We don't worry about the microscopic overhead of inter-thread synchronization both because there's nothing we can do about it and because it's negligible.
If you literally want to find some way that one thread can never, ever slow another thread down, you'll have to re-design pretty much everything we use threads for.
Update:
For example, consider some code that has a mutex and a boolean. The boolean indicates whether or not the work is done. The "assign work" flow looks like this:
Create a work object with a mutex and a boolean. Set the boolean to false.
Dispatch a thread to work on that object.
The "do work" flow looks like this:
Do work. (The mutex is not held here.)
Acquire mutex.
Set boolean to true.
Release mutex.
The "is work done" flow looks like this:
Acquire mutex.
Copy boolean.
Release mutex.
Look at copied value.
This allows one thread to do work and another thread to check if the work is done any time it wants to while doing other things. The only case where one thread waits for the other is the one-in-a-million case where a thread that needs to check if the work is done happens to check right at the instant that the work has just finished. Even in that case, it will typically block for less than a microsecond as the thread that holds the mutex only needs to set one boolean and release the mutex. And if even that bothers you, most mutexes have a non-blocking "try to lock" function (which you would use in the "check if work is done" flow so that the checking thread never blocks).
And this is the normal way mutexes are used. Actual contention is the exception, not the rule.

Ending a thread that might be joined or dereferenced

I'm having a problem deciding on what to do in this situation, I want to have a detached thread, but still be able to join it in case I want to abort it early, presumably before starting a new instance of it, to make sure I don't have the thread still accessing things when it shouldn't.
This means I shouldn't detach the thread right after calling it, so then I have a few options:
Self-detach the thread when it's reaching the end of its execution, but then wouldn't this cause problems if I try to join it from the main thread? This would be my prefered solution if the problem of trying to join it after it's self-detached could be solved. I could dereference the thread handle that the main thread has access to from the self-detaching thread before self-detaching it, however in case the main thread tries to join right before the handle is dereferenced and the thread self-detached this could cause problems, so I'd have to protect the dereferencing in the thread and however (I don't know how, I might need to create a variable to indicate this) I would check if I should join in the main thread with a mutex, which complicates things. Somehow I have a feeling that this isn't the right way to do it.
Leave the thread hanging until eventually I join it, which could take a long time to happen, depending on how I organise things it could be not before I get rid of what it made (e.g. joining the thread right before freeing an image that was loaded/processed by the thread when I don't need it anymore)
Have the main thread poll periodically to know when the thread has done its job, then join it (or detach it actually) and indicate not to try joining it again?
Or should I just call pthread_exit() from the thread, but then what if I try to join it?
If I sound a bit confused it's because I am. I'm writing in C99 using TinyCThread, a simple wrapper to pthread and Win32 API threading. I'm not even sure how to dereference my thread handles, on Windows the thread handle is HANDLE, and setting a handle to NULL seems to do it, I'm not sure that's the right way to do it with the pthread_t type.
Epilogue: Based on John Bollinger's answer I chose to go with detaching the thread, putting most of that thread's code in a mutex, this way if any other thread wants to block until the thread is practically done it can use that mutex.
The price of using an abstraction layer such as TinyCThreads is that you can rely only on the defined characteristics of the abstraction. Both Windows and POSIX provide features and details that are not necessarily reflected by TinyCThreads. On the other hand, this may force you to rely on a firmer foundation than you might otherwise hack together with the help of implementation-specific features.
Anyway, you say,
I want to have a detached thread, but still be able to join it in case I want to abort it early,
but that's inconsistent. Once you detach a thread, you cannot join it. I suspect you meant something more like, "I want a thread that I can join as long as it is running, but that I don't have to join when it terminates." That's at least consistent, but it focuses on mechanism.
What I think you actually want would be described better as a thread that you can cancel synchronously as long as it is running, but that you otherwise don't need to join when it terminates. I note, however, that the whole idea presupposes a way to make the thread terminate early, and it does not appear that TinyCThread provides any built-in facility for that. It will also require a mechanism to determine whether a given thread is still alive, and TinyCThread does not provide that, either.
First, then, you need some additional per-thread shared state that tracks thread status (running / abort requested / terminated). Because the state is shared, you'll need a mutex to protect it, and that will probably need to be per-thread, too. Furthermore, in order to enable one thread (e.g. the main one) to wait for that state to change when it cancels a thread, it will need a per-thread condition variable.
With that in place, the new thread can self-detach, but it must periodically check whether an abort has been requested. When the thread ends its work, whether because it discovers an abort has been requested or because it reaches the normal end of its work, it performs any needed cleanup, sets the status to "terminated", broadcasts to the CV, and exits.
Any thread that wants to cancel another locks the associated mutex, and checks whether the thread is already terminated. If not, it sets the thread status to "abort requested", and waits on the condition variable until the status becomes "terminated". If desired, you could use a timed wait to allow the cancellation request to time out. After successfully canceling the thread, it may be possible to clean up the mutex, cv, and shared variable.
I note that all of that hinges on my interpretation of your request, and in particular, on the prospect that what you're after is aborting / canceling threads. None of the alternatives you floated seem to address that; for the most part they abandon the unwanted thread, which does not serve your expressed interest in preventing it from making unwanted changes to shared state.
It's not clear to me what you want, but you can use a condition variable to implement basically arbitrary joining semantics for threads. The POSIX Rationale contains an example of this, showing how to implement pthread_join with a timeout (search for timed_thread).

Know how many are waiting on a pthread mutex lock

I would like to know how many threads are waiting on a lock so I would be able to destroy it safely.
The problem is that I can't destroy the lock when someone holds it or someone is waiting on it.
My program can make sure that no new requests are made to acquire the lock, but how can I know when all the threads that waited on it are done with it?
I thought about a conditional variable but I suspect it will create problems..
dlv, could you add some code snippet to your description.
I hope you should be using condition variables,
Each thread will block in pthread_cond_wait() until the other thread signals it to wake up. This will not cause a deadlock. It can easily be extended to many threads, by allocating one int, pthread_cond_t and pthread_mutex_t per thread.
pthread_cond_wait() blocks the calling thread until the specified condition is signalled. This routine should be called while mutex is locked, and it will automatically release the mutex while it waits. After signal is received and thread is awakened, mutex will be automatically locked for use by the thread. The programmer is then responsible for unlocking mutex when the thread is finished with it.
The pthread_cond_signal() routine is used to signal (or wake up) another thread which is waiting on the condition variable. It should be called after mutex is locked, and must unlock mutex in order for pthread_cond_wait() routine to complete.
The pthread_cond_broadcast() routine should be used instead of pthread_cond_signal() if more than one thread is in a blocking wait state.
It is a logical error to call pthread_cond_signal() before calling pthread_cond_wait().
Proper locking and unlocking of the associated mutex variable is essential when using these routines. For example:
Failing to lock the mutex before calling pthread_cond_wait() may cause it NOT to block.
Failing to unlock the mutex after calling pthread_cond_signal() may not allow a matching pthread_cond_wait() routine to complete (it will remain blocked).
If threads that can use the mutex still exist or might be created in the future then don't delete it.
You do know and are tracking what threads are created, right?
If, for some reason, you cannot keep track of the threads using a resource, your only way out is to leak the resource. It can never be safely deleted because you never know when you are done using it.
Say you had a counter that counted the threads using a mutex. That counter would need its own mutex. Then how do you decide when to delete that one?
That way of thinking is the road that leads to hell. You could do what you want with condition variables, but the result would be an extremely weak design.
Assuming you managed to create such a monster, it would basically allow you to kill "safely" any other thread regardless of its internal state. Except for a quick and dirty panic exit (in case of some internal software error), this is the worst possible way of solving synchronization issues.
A design relying on such tricks would have to create implicit synchronizations between tasks to make sure the terminations occur in the proper order. A lot of software are designed that way, and most of them allow mediocre programmers to make a living by maintaining the pile of crap they created in the first place.
Task termination should be an issue solved at global design level, not by a toolbox of wonky objects that allow you to twist synchronization any odd way.

Is Deadlock recovery possible in MultiThread programming?

Process has some 10 threads and all 10 threads entered DEADLOCK state( assume all are waiting for Mutex variable ).
How can you free process(threads) from DEADLOCK state ? .
Is there any way to kill lower priority thread ?( in Multi process case we can kill lower priority process when all processes in deadlock state).
Can we attach that deadlocked process to the debugger and assign proper value to the Mutex variable ( assume all the threads are waiting on a mutex variable MUT but it is value is 0 and can we assign MUT value to 1 through debugger ) .
If every thread in the app is waiting on every other, and none are set to time out, you're rather screwed. You might be able to run the app in a debugger or something, but locks are generally acquired for a reason -- and manually forcing a mutex to be owned by a thread that didn't legitimately acquire it can cause some big problems (the thread that previously owned it is still going to try and release it, the results of which can be unpredictable if the mutex is unexpectedly yanked away. Could cause an unexpected exception, could cause the mutex to be unlocked while still in use.) Anyway it defeats the whole purpose of mutexes, so you're just covering up a much bigger problem.
There are two common solutions:
Instead of having threads wait forever, set a timeout. This is slightly harder to do in languages like Java that embed mutexes into the language via synchronized or lock blocks, but it's almost always possible. If you time out waiting on the lock, release all the locks/mutexes you had and try later.
Better, but potentially much more complex, is to figure out why everything's fighting for the resource and remove that contention. If you must lock, lock consistently. But if there's 10 threads blocking on a single mutex, that could be a clue either that your operations are badly chunked (ie: that your threads are doing too much or too little at once before trying to acquire a lock), or that there's unnecessary locking going on. Don't lock unless you have to. Some synchronization could be obviated by using collections and algorithms specifically designed to be "lock-free" while still offering thread-safety.
Adding another answer because I don't agree with the solutions proposed by cHao earlier - the analysis is fine.
First, why I disagree with the two solutions offered:
Reduce contention
Contention doesn't lead to deadlocks. It just causes poor performance. Deadlock means no performance whatsoever. Therefore, reducing contention does not solve deadlocks.
timeout on mutex.
A mutex protects a resource, and a thread locks the mutex because it needs the resource. With a timeout, you won't be able to acquire the resource, and your thread fails. Does it solve the deadlock problem? Only if the failing thread releases another resource that was blocking the other threads.
But in that case, there's a much better solution. Mutexes should have a partial ordering. If there is at least one thread that can both mutex A and B, you should decide whether A or B is acquired first, and then stick with that. This must be a transitive order: if you lock A before B, and B before C, then obviously you must lock A before C.
This is a perfect solution to deadlocks. Look back at the timeout example: it only works if the thread that times out waiting on A then releases its lock on B, to release another thread that was waiting on B. In the most simple case, that other thread was itself directly locking A. Thus, the mutexes A and B are not properly ordered. You should have consistently locked either A or B first.
The timeout case could also be the result of a cyclic order problem; one thread locks A then B, another B then C, and a third C then A, with the deadlock happening when each thread owns one lock. The solution again is the same; order the locks.
Alternatively said, mutex lock orders can be described by a directed graph. If a thread locks A before B, there's an arc from A to B. Deadlocks appear if the directed graph is cyclic, and then the arcs of that cycle are the deadlocked threads.
This theory can be a bit complex, but there are some simple insights to be found. For instance, from the graph theory, we know that trees are acyclic graphs. Hence, neither "leaf mutexes" (those that are always locked last) nor "root mutexes" (those that are always locked first) can cause deadlocks. Leaf mutexes are excluded because no thread ever blocks holding them, and root mutexes are excluded because the thread that holds them will be able to lock all subsequent mutexes in due time.

Resources