Tips for debugging hard-to-reproduce concurrency bugs? - multithreading

What are some tips for debugging hard to reproduce concurrency bugs that only happen, say, once every thousand runs of a test? I have one of these and I have no idea how to go about debugging it. I can't put print statements or debugger watches all over the place to observe internal state, because that would change timings and produce overwhelming amounts of information when the bug is not successfully reproduced.

Here is my technique : I generally use a lot of assert() to check the data consistency/validity as often as possible. When one assert fails, the program crashes generating a core file. Then I use a debugger with the core file to understand what thread configuration led to data corruption.

This might not help you but will probably help someone seeing this question in the future.
If you're using a .Net language you can use the CHESS project from Microsoft research. It runs unit tests with every kind of thread interleaving and shows you which ones cause the bug to happen.
There may be a similar tool for the language you're using.

It highly depends on the nature of the problem. Commonly useful are bisection (to narrow down the search space) + code "instrumentation" with assertions for accessing thread IDs, lock/unlock counts, locking order, etc. in the hope that when the problem will reproduce next time the application will either log a verbose message or will core-dump giving you the solution.

One method for finding data corruption caused by concurrency bug:
Add an atomic counter for that data or buffer.
Leave all the existing synchronizing code as is - don't modify them, assuming that you're going to fix the bug in the existing code, whereas the new atomic counter will be removed once the bug is fixed.
When starting to modify the data, increment the atomic counter. When finished, decrement.
Core dump as soon as you find that the counter is greater than one (using something similar to InterlockedIncrement)

Targeted unit test code is time-consuming but effective, in my experience.
Narrow down the failing code as much as you can. Write test code that's specific to the apparent culprit code and run it in a debugger for as long as it takes to reproduce the problem.

One of the strategies I use is to simulate interleaving of the threads is by introducing spin waits. The caveat is that you should not utilize the standard spin wait mechanisms for your platform because they will likely introduce memory barriers. If the issue you are trying to troubleshoot is caused by a lack of a memory barrier (because it is difficult to get the barriers correct when using lock-free strategies) then the standard spin wait mechanisms will just mask the problem. Instead, place an empty loop at the points where you want your code to stall for a moment. This can increase the probability of reproducing a concurrency bug, but it is not a magic bullet.

If the bug is a deadlock, simply attaching a debugging tool (like gdb or strace) to the program after the deadlock happens, and observing where each thread is stuck, can often get you enough information to track down the source of the error quickly.

A little chart I've made with some debugging techniques to take in mind in debugging multithreaded code. The chart is growing, please leave comments and tips to be added. http://adec.altervista.org/blog/multithreading-debugging-chart/

Related

Using D for a realtime application?

I am considering using d for my ongoing graphics engine. The one thing that turns me down is the GC.
I am still a young programmer and I probably have a lot of misconceptions about GC's and I hope you can clarify some concerns.
I am aiming for low latency and timing in general is crucial. From what I know is that GC's are pretty unpredictable, for example my application could render a frame every 16.6ms and when to GC's kicks in it could go up to any number like 30ms because it is not deterministic right?
I read that you can turn down the GC in D, but then you can't use the majority of D's standard library and the GC is not completely off. is this true?
Do you think it makes sense to use D in a timing crucial application?
Short answer: it requires lot of customization and can be really difficult if you are not an experienced D developer.
List of issues:
Memory management itself is not that big problem. In real-time applications you never ever want to allocate memory in a main loop. Having pre-allocated memory pools for all main data is pretty much de-facto standard way to do such applications. In that sense, D is not different - you still call C malloc directly to get some heap for your pools and this memory won't be managed by a GC, it won't even know about it.
However, certain language features and large parts of Phobos do use GC automagically. For example, you can't really concatenate slices without some form of automatically managed allocation. And Phobos simply has not had a strong policy about this for quite a long time.
Few language-triggered allocations won't be a problem on their own as most memory used is managed via pools anyway. However, there is a killer issue for real-time software in stock D : default D garbage collector is stop-the-world. Even if there is almost no garbage your whole program will hit a latency spike when collection cycle is ran, as all threads get blocked.
What can be done:
1) Use GC.disable(); to switch off collection cycles. It will solve stop-the-world issue but now your program will start to leak memory in some cases, as GC-based allocations still work.
2) Dump hidden GC allocations. There was a pull request for -vgc switch which I can't find right now, but in its absence you can compile your own druntime version that prints backtrace upon gc_malloc() call. You may want to run this as part of automatic test suite.
3) Avoid Phobos entirely and use something like https://bitbucket.org/timosi/minlibd as an alternative.
Doing all this should be enough to target soft real-time requirements typical for game dev, but as you can see it is not simple at all and requires stepping out of stock D distribution.
Future alternative:
Once Leandro Lucarella ports his concurrent garbage collector to D2 (which is planned, but not scheduled), situation will become much more simple. Small amount of GC-managed memory + concurrent implementation will allow to meet soft real-time requirements even without disabling GC. Even Phobos can be used after it is stripped from most annoying allocations. But I don't think it will happen any soon.
But what about hard real-time?
You better not even try. But that is yet another story to tell.
If you do not like GC - disable it.
Here is how:
import core.memory;
void main(string[] args) {
GC.disable;
// your code here
}
Naturally, then you will have to do the memory manage yourself. It is doable, and there are several articles about it. It has been discussed here too, I just do not remember the thread.
dlang.org also has useful information about this. This article, http://dlang.org/memory.html , touches the topic of real-time programming and you should read it.
Yet another good article: http://3d.benjamin-thaut.de/?p=20 .

Runtime integrity check of executed files

I just finished writing a linux security module which verifies the integrity of executable files at the start of their execution (using digital signatures). Now I want to dig a little bit deeper and want to check the files' integrity during run-time (i.e. periodically check them - since I am mostly dealing with processes that get started and run forever...) so that an attacker is not able to change the file within main memory without being identified (at least after some time).
The problem here is that I have absolutely no clue how I can check the file's current memory image. My authentication method mentioned above makes use of a mmap-hook which gets called whenever a file is mmaped before its execution, but as far as I know the LSM framework does not provide tools for periodical checks.
So my question: Are there any hints how I shoudl start this? How I can read a memory image and check its integrity?
Thank you
I understand what you're trying to do, but I'm really worried that this may be a security feature that gives you a warm fuzzy feeling for no good reason; and those are the most dangerous kinds of security features to have. (Another example of this might be the LSM sitting right next to yours, SElinux. Although I think I'm in the minority on this opinion...)
The program data of a process is not the only thing that affects its behavior. Stack overflows, where malicious code is written into the stack and jumped into, make integrity checking of the original program text moot. Not to mention the fact that an attacker can use the original unchanged program text to his advantage.
Also, there are probably some performance issues you'll run into if you are constantly computing DSA inside the kernel. And, you're adding that much more to long list of privileged kernel code that could be possibly exploited later on.
In any case, to address the question: You can possibly write a kernel module that instantiates a kernel thread that, on a timer, hops through each process and checks its integrity. This can be done by using the page tables for each process, mapping in the read only pages, and integrity checking them. This may not work, though, as each memory page probably needs to have its own signature, unless you concatenate them all together somehow.
A good thing to note is that shared libraries only need to be integrity checked once per sweep, since they are re-mapped across all the processes that use them. It takes sophistication to implement this though, so maybe have this under this "nice-to-have" section of your design.
If you disagree with my rationale that this may not be a good idea, I'd be very interested in your thoughts. I ran into this idea at work a while ago, and it would be nice to bring fresh ideas to our discussion.

Threading run time without adding extra lines in program

Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?
I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.
Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.
Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.
The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.

Debugging deadlocking threads in a MT program?

What are the possible ways to debug deadlocking threads in a MT program, other than gdb?
On some platforms deadlock detection tools may help you find already observed and not yet observed deadlocks, as well as other bugs.
On Solaris, try LockLint.
On Linux, try Helgrind or DRD.
If you're using POSIX, try investigating PTHREAD_MUTEX_ERRORCHECK.
I've always invested some time into writing or grafting on a flexible logging facility into projects I've worked on, and it always paid off handsomely in turning difficult bugs into easy ones. At the very least, wrapping locking primitives in functions or methods that log before and after logging, and display the object being locked and the thread that's doing the locking always helped me to zero in on the offending thread in a matter of minutes - assuming that the problem can be reproduced at all, of course.
Loading the program under a debugger is actually a pretty limited method of determining what happened once a process deadlocks, since all it can give you is a snapshot of how badly you messed up, rather than a step by step explanation of how you screwed up, which I find a lot more helpful.
Or get the Intel Thread Checker. Fine work.

Is firing off a Thread a valid answer to simplifying code?

As multi-processor and multi-core computers become more and more ubiquitous, is simply firing off a new thread a (relatively) simple and painless way of simplifying code? For instance, in a current personal project, I have a network server listening on a port. Since this is just a personal project, it's just a desktop app, with a GUI integrated into it for configuration. So, the app reads something like this:
Main()
Read configuration
Start listener thread
Run GUI
Listener Thread
While the app is running
Wait for a new connection
Run a client thread for the new connection
Client Thread
Write synchronously
Read synchronously
ad inifinitum, or till they disconnect
This approach means that while I have to worry about alot of locking, with the potential issues that involves, I avoid alot of spaghetti code from assynchronous calls, etc.
A slightly more insidious version of this came up today when I was working on the startup code. The startup was quick, but it was using lazy loading for alot of the configuration, which meant that while startup was quick, actually connecting to and using the service was difficult because of the lag while it loaded different sections (this was actually measurable in real time, up to 3-10 seconds sometimes). So I moved to a different strategy, on startup, loop through everything and force the lazy loading to kick in... but this made it start prohibitively slow; get up, go get a coffee slow. Final solution: throw the loop into a seperate thread with feedback in the system tray while it's still loading.
Is this "Meh, throw it in another thread, it'll be fine" attitude ok? At what point do you start getting diminishing returns and/or even reduced performance?
Multithreading does a lot of things, but I don't think "simplification" is ever one of them.
It's a great way to introduce bugs into code.
Using multiple threads properly is not easy. It should not be attempted by new developers.
In my opinion, multi-threaded programming is pretty high up on the difficulty (and complexity) scale, along with memory management. To me, the "Meh, throw it in another thread, it'll be fine" attitude is a bit too casual. Think long and hard you must, before forking threads you do.
No.
Plainly and simply, multithreading increases complexity and is a nearly trivial way to add bugs to code. There are concurrency issues such as synchronization, deadlock, race conditions, and priority inversion to name a few.
Secondly, the performance gains are not automatic. Recently, there was an excellent article in MSDN Magazine along these lines. The salient details are that a certain operation was taking 46 seconds per ten iterations coded as a single-threaded operation. The author parallelized the operation naively (one thread per four cores) and the operation dropped to 30 seconds per ten iterations. Sounds great until you take into consideration that the operation now eats 300% more processing power but only experienced a 34% gain in efficiency. It's not worth consuming all available processing power for a gain like that.
This gives you the extra job of debugging race conditions, and handling locks and sycronisation issues.
I would not use this unless there was a real need.
Read up on Amdahl's law, best summarized by "The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program."
As it turns out, if only a small part of your app can run in parallel you won't get much gains, but potentially many hard-to-debug bugs.
I don't mean to be flip but what's in that configuration file that it takes so long to load? That's the origin of your problem, right?
Before spawning another thread to handle it, perhaps it can be parred down? Reduced, perhaps put in another data format that would be quicker, etc?
How often does it change? Is it something you can parse once at the beginning of the day and put the variables in shared memory so subsequent runs of your main program can just attach and get the needed values from there?
While I agree with everyone else here in saying that multithreading does not simplify code, it can be used to greatly simplify the user experience of your application.
Consider an application that has a lot of interactive widgets (I am currently developing one where this helps) - in the workflow of my application, a user can "build" the current project they are working on. This requires disabling the interactive widgets my application presents to the user and presenting a dialog with a indeterminate progress bar and a friendly "please wait" message.
The "build" occurs on a background thread; if it were to happen on the UI thread it would make the user experience less enjoyable - after all, it's no fun not being able to tell whether or not you are able to click on a widget in an application while a background task is running (cough, Visual Studio). Not to say that VS doesn't use background threads, I'm just saying their user experience could use some improvement. But I digress.
The one thing I take issue with in the title of your post is that you think of firing off threads when you need to perform tasks - I generally prefer to reuse threads - in .NET, I generally favor using the system thread pool over creating a new thread each time I want to do something, for the sake of performance.
I'm going to provide some balance against the unanimous "no".
DISCLAIMER: Yes, threads are complicated and can cause a whole bunch of problems. Everyone else has pointed this out.
From experience, a sequence of blocking reads/writes to a socket (which requires a separate thead) is much simpler than non-blocking ones. With blocking calls, you can tell the state of the connection just by looking at where you are in the function. With non-blocking calls, you need a bunch of variables to record the state of the connection, and check and modify them every time you interact with the connection. With blocking calls, you can just say "read the next X bytes" or "read until you find X" and it will actually do it (or fail). With non-blocking calls, you have to deal with fragmented data which usually requires keeping temporary buffers and filling them as necessary. You also end up checking if you've received enough data every time you receive little more. Plus you have to keep a list of open connections and handle unexpected closes for all of them.
It doesn't get much simpler than this:
void WorkerThreadMain(Connection connection) {
Request request = ReadRequest(connection);
if(!request) return;
Reply reply = ProcessRequest(request);
if(!connection.isOpen) return;
SendReply(reply, connection);
connection.close();
}
I'd like to note that this "listener spawns off a worker thread per connection" pattern is how web servers are designed, and I assume it's how a lot of request/response soft of server applications are designed.
So in conclusion, I have experienced the asynchronous socket spaghetti code you mentioned, and spawning off worker threads for every connection ended up being a good solution. Having said all this, throwing threads at a problem should usually be your last resort.
I think your have no choice but to deal with threads especially with networking and concurrent connections. Do threads make code simpler? I don't think so. But without them how would you program a server that can handle more than 1 client at the same time?

Resources