How can I get stacktraces across all threads of an already running process, on Linux x64, in a way that is the least invasive and impacting as possible?
Things I've thought of till now:
gdb - I'm afraid it would slow the process too much, and for too long;
strace+ - no idea what performance it has, any experience anybody? still, IIUC, it traces only syscalls, and I can't even expect each thread enters a syscall, specifically some threads may be already hanging;
force crash & get a coredump - yeah... if I could do that easily, I would probably already be busy debugging... please, let's assume there's no elephant in the room, for the purpose of this question, ok?... pretty please...
There's a gcore utility that comes with gdb. You don't need to force a crash to get a core dump.
That's exactly what pstack does. See http://www.linuxcommand.org/man_pages/pstack1.html
Related
I have a process running on Linux which creates a lot of pThreads (each of them have its own purpose). Let's say by some reason one of threads crashes. Sometimes, crash might be caused by some other thread and it would be good to know what threads were running before the crashed one.
So the question is:
Is there a way to ask Linux scheduler what last threads were scheduled?
Any help is really appreciated.
Thanks.
May be you are aware of the Linux "top" command which can show you all the threads open by your process:
top -H -p "pid of your process"
I may help to identify that how many threads are running which is stopped or crashed.
you will have to make changes in kernel code to gather the scheduling data at each context switch and keep writing in some place in memory, it is somewhat similar to Flight recorder functionality available in PNE kernel.
Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch
I'm working on linux and I'm using a pthread_rwlock, which is stored in shared memory and shared over multiple processes. This mostly works fine, but when I kill a process (SIGKILL) while it is holding a lock, it appears that the lock is still held (regardless of whether it's a read- or write-lock).
Is there any way to recognize such a state, and possibly even repair it?
The real answer is to find a decent way to stop a process. Killing it with SIGKILL is not a decent way to do it.
This feature is specified for mutexes, called robustness (PTHREAD_MUTEX_ROBUST) but not for rwlocks. The standard doesn't provide it and kernel.org doesn't even have a page on rwlocks. So, like I said:
Find another way to stop the process (perhaps another signal that can be handled ?)
Release the lock when you exit
#cnicutar - that "real answer" is pretty dubious. It's the kernel's job to handle cross process responsibilities of freeing of resources and making sure things are marked consistent - userspace can't effectively do the job when stuff goes wrong.
Granted if everybody plays nice the robust features will not be needed but for a robust system you want to make sure the system doesn't go down from some buggy client process.
Can someone please tell me interview type questions related to multithreading and GDB.
I already know Deadlock, race condition, synchronization and basics of threads.
Thanks in advance
Some sample questions:
How do you list out all the threads?
How do you set breakpoints in individual threads?
How do you see stacktrace of a particular thread?
Your program is in a deadlock; How do you find the root cause using gdb?
There is no end to questions. I would suggest that the best way to learn is to get knees deep in the dirt and play for yourself:
Make a sample multi-threaded program, debug it and try to find all possible info about all the threads.
Put some deadlock situation, and then debug it.
What are the possible ways to debug deadlocking threads in a MT program, other than gdb?
On some platforms deadlock detection tools may help you find already observed and not yet observed deadlocks, as well as other bugs.
On Solaris, try LockLint.
On Linux, try Helgrind or DRD.
If you're using POSIX, try investigating PTHREAD_MUTEX_ERRORCHECK.
I've always invested some time into writing or grafting on a flexible logging facility into projects I've worked on, and it always paid off handsomely in turning difficult bugs into easy ones. At the very least, wrapping locking primitives in functions or methods that log before and after logging, and display the object being locked and the thread that's doing the locking always helped me to zero in on the offending thread in a matter of minutes - assuming that the problem can be reproduced at all, of course.
Loading the program under a debugger is actually a pretty limited method of determining what happened once a process deadlocks, since all it can give you is a snapshot of how badly you messed up, rather than a step by step explanation of how you screwed up, which I find a lot more helpful.
Or get the Intel Thread Checker. Fine work.