What are the possible ways to debug deadlocking threads in a MT program, other than gdb?
On some platforms deadlock detection tools may help you find already observed and not yet observed deadlocks, as well as other bugs.
On Solaris, try LockLint.
On Linux, try Helgrind or DRD.
If you're using POSIX, try investigating PTHREAD_MUTEX_ERRORCHECK.
I've always invested some time into writing or grafting on a flexible logging facility into projects I've worked on, and it always paid off handsomely in turning difficult bugs into easy ones. At the very least, wrapping locking primitives in functions or methods that log before and after logging, and display the object being locked and the thread that's doing the locking always helped me to zero in on the offending thread in a matter of minutes - assuming that the problem can be reproduced at all, of course.
Loading the program under a debugger is actually a pretty limited method of determining what happened once a process deadlocks, since all it can give you is a snapshot of how badly you messed up, rather than a step by step explanation of how you screwed up, which I find a lot more helpful.
Or get the Intel Thread Checker. Fine work.
Related
This is the first time I am trying to profile a multi-threaded program.
I suspect the problem is it waiting for something, but I have no clue what, the program never reaches 100% of CPU, GPU, RAM or I/O use.
Until recently, I've only worked on projects with single-threading, or where the threads were very simple (example: usually an extra thread just to ensure the UI is not locked while the program works, or once I made a game engine with a separate thread to handle .XM and .IT files music, so that the main thread could do everything, while the other thread in another core could take care of decoding those files).
This program has several threads, and they don't do parallel work on the same tasks, each thread has its own completely separate purpose (for example one thread is dedicated to handling all sound-related API calls to the OS).
I downloaded Microsoft performance tools, there is a blog by an ex-Valve employee that explains that they work to do this, but although I even managed to make some profiles and whatnot, I don't really understood what I am seeing, it is only a bunch of pretty graphs to me (except the CPU use graph, that I already knew from doing sample-based profiling on single-threaded apps), so, how I find why the program is waiting on something? Or how I find what is it waiting for? How I find what thread is blocking the others?
I look at is as an alternation between two things:
a) measuring overall time, for which all you need is some kind of timer, and
b) finding speedups, which does not mean measuring, in spite of what a lot of people have been told.
Each time you find a speedup, you time the results and do it again.
That's the alternation.
To find speedups, the method I and many people use is random pausing.
The idea is, you get the program running under a debugger and manually interrupt it, several times.
Each time, you examine the state of every thread, including the call stack.
It is very crude, and it is very effective.
The reason this works is that the only way the program can go faster is if it is doing an activity that you can remove, and if that saves a certain fraction of time, you are at least that likely to see it on every pause.
This works whether it is doing I/O, waiting for something, or computing.
It sees things that profilers do not expose, because they make summaries from which speedups can easily hide.
Performance Wizard in Visual Studio Performance and Diagnostics Hub has "Resource contention data" profiling regime which allows to analyze concurrency contention among threads, i.e. how the overall performance of a program is impacted by threads waiting on other threads. Please refer to this blog post for more details.
PerfView is an extremely powerful profiling tool which allows one to analyze the impact of service threads and tasks to the overall performance of the program. Here is the PerfView Tutorial available.
I'm designing an application for an asphalt batch mix plant, using a thread to run the mixing process, several timers to read system states and perform control actions.
If "Hyper-Threading" features is disabled, the application will run smoothly, everything is OK; or it will bring up a dialog grumbling that memory access is invalid and abort immediately after click "OK".
Don't know why? Maybe something wrong with IDE version, since Delphi 5 was released at 10th August 1999; maybe the thread unit in Delphi 5.0 cannot deal with new CPU technology?
Maybe memory management has some bugs, maybe the thread mode is not suitable for new era?
I want to upgrade the IDE, but since there are many many years pasted, I have no idea which would be the best choice,
Delphi 7? Delphi 2007(which support OmniThreadLibrary)? RAD Studio XE6/7? Hope someone will help.
The most plausible explanation is that your program has a bug related to threading. You happen to get away with the flaw in your code when hyperthreading is disabled, but enabling it is sufficient to make the error in your code manifest.
Threading bugs are just like this. They will manifest if threads execute specific code in a particular order, with respect to the other threads. And the relative ordering is unpredictable. Which is part and parcel of parallel computation. Code that is broken can appear to be correct when running under one environment, but then fail under another. Whilst it is tempting to blame the tools, always check in the mirror first.
Changing development environment is not the solution. What you need to do is to find and then fix the error in your code. Getting a good stack trace will help, and I can recommend a tool like madExcept for that.
How can I get stacktraces across all threads of an already running process, on Linux x64, in a way that is the least invasive and impacting as possible?
Things I've thought of till now:
gdb - I'm afraid it would slow the process too much, and for too long;
strace+ - no idea what performance it has, any experience anybody? still, IIUC, it traces only syscalls, and I can't even expect each thread enters a syscall, specifically some threads may be already hanging;
force crash & get a coredump - yeah... if I could do that easily, I would probably already be busy debugging... please, let's assume there's no elephant in the room, for the purpose of this question, ok?... pretty please...
There's a gcore utility that comes with gdb. You don't need to force a crash to get a core dump.
That's exactly what pstack does. See http://www.linuxcommand.org/man_pages/pstack1.html
What are some tips for debugging hard to reproduce concurrency bugs that only happen, say, once every thousand runs of a test? I have one of these and I have no idea how to go about debugging it. I can't put print statements or debugger watches all over the place to observe internal state, because that would change timings and produce overwhelming amounts of information when the bug is not successfully reproduced.
Here is my technique : I generally use a lot of assert() to check the data consistency/validity as often as possible. When one assert fails, the program crashes generating a core file. Then I use a debugger with the core file to understand what thread configuration led to data corruption.
This might not help you but will probably help someone seeing this question in the future.
If you're using a .Net language you can use the CHESS project from Microsoft research. It runs unit tests with every kind of thread interleaving and shows you which ones cause the bug to happen.
There may be a similar tool for the language you're using.
It highly depends on the nature of the problem. Commonly useful are bisection (to narrow down the search space) + code "instrumentation" with assertions for accessing thread IDs, lock/unlock counts, locking order, etc. in the hope that when the problem will reproduce next time the application will either log a verbose message or will core-dump giving you the solution.
One method for finding data corruption caused by concurrency bug:
Add an atomic counter for that data or buffer.
Leave all the existing synchronizing code as is - don't modify them, assuming that you're going to fix the bug in the existing code, whereas the new atomic counter will be removed once the bug is fixed.
When starting to modify the data, increment the atomic counter. When finished, decrement.
Core dump as soon as you find that the counter is greater than one (using something similar to InterlockedIncrement)
Targeted unit test code is time-consuming but effective, in my experience.
Narrow down the failing code as much as you can. Write test code that's specific to the apparent culprit code and run it in a debugger for as long as it takes to reproduce the problem.
One of the strategies I use is to simulate interleaving of the threads is by introducing spin waits. The caveat is that you should not utilize the standard spin wait mechanisms for your platform because they will likely introduce memory barriers. If the issue you are trying to troubleshoot is caused by a lack of a memory barrier (because it is difficult to get the barriers correct when using lock-free strategies) then the standard spin wait mechanisms will just mask the problem. Instead, place an empty loop at the points where you want your code to stall for a moment. This can increase the probability of reproducing a concurrency bug, but it is not a magic bullet.
If the bug is a deadlock, simply attaching a debugging tool (like gdb or strace) to the program after the deadlock happens, and observing where each thread is stuck, can often get you enough information to track down the source of the error quickly.
A little chart I've made with some debugging techniques to take in mind in debugging multithreaded code. The chart is growing, please leave comments and tips to be added. http://adec.altervista.org/blog/multithreading-debugging-chart/
Leading on from the answer to another question about bugs in the Delphi IDE, does anyone know if there is a way to improve the multi-threaded debugging functionality of the IDE, or if not, at least why it is so bad on occasion?
When you have multiple threads within a program, stepping through the code with F7 or F8 can often lead to either very long pauses, or the whole IDE just locks ups. This is especially apparent when you leave or enter a method or procedure. The debugger always seems to be fine for single threaded applications.
PS. The version I'm using is 2007
From my experience multi threaded debugging is much nicer using Vista and Delphi 2009 than XP with Delphi 2007.
First, the ide is significantly more stable.
Second, in Delphi 2009 on vista the debugger can show you where deadlocks are occurring.
If you have to use Delphi 2007, I would strongly recommend debugging your code in a single threaded unit test if possible, then using your by now tested code in the main program. ;)
When the application itself has not deadlocked, try to be very aware of which thread you're in. Keep the thread list up in the debugger, and consider using named threads.
There are times when it will be impossible to interactively debug an application which itself deadlocks. When this happens, you can use tools such as WinDbg and Adplus in order to work with memory dumps. Yes, this is a lot harder than using the interactive debugger, but it beats having no debugger at all. There are sample applications, demos, and instructions, on Tess Ferrandez's blog. I would start with this page. The labs are .NET-centric, but don't let that keep you away; the ideas are the same.
When I want to debug a multithreaded operation I often use a log file (that I analyse after the application has run) instead of the interactive Debugger.
For example with the function 'OutputDebugString'. The output comes in the event log of Delphi. If you start your program outside of Delphi, you can use DebugView from SysInternals to display the log. Take care to add the Thread-ID to each output (GetCurrentThreadID).
Be aware that there could be a thread switch just before writing to the log. But at places where several threads interact you will probably have a critical session (or another synchronization object) so that it should be a problem.
Yes, debugging a multithreaded application is a hassle. Because you are constantly swapping from one thread to another.
Another Idea that I have never tried because I've just thought about it: if you are interested in debugging one thread and just want to avoid being disturbed by the other threads, it might be possible to suspend some threads temporaly.
Process Explorer from SysInternals offers a possibility to suspend and resume threads (in the tab called "Threads" in the properties of a process). But as I said I've never tested it until now.