Why would gdb hang? - linux

I have an application that I am debugging and I'm trying to understand how gdb works and why I am not able to step through the application sometimes. The problem that I am experiencing is that gdb will hang and the process it is attached to will enter a defunct state when I am stepping through the program. After gdb hangs and I have to kill it to free the terminal (ctrl-C does not work, I have to do this from a different terminal window by getting the process id for that gdb session and using kill -9).
I'm guessing that gdb is hanging because it's waiting for the application to stop at the next instruction and somehow the application finished execution without gdb identifying this. But that's just speculation on my part from the behavior I've observed thus far. So my question is if anyone has seen this type of behavior before and/or could suggest what the cause might be. I think that might help me improve my debugging strategy.
In case it matters I'm using g++ 4.4.3, gdb 7.1, running on Ubuntu 10.04 x86_64.

I had a similar problem and solved it by sending a CONT signal to the process being debugged.

I'd say the debugged process wouldn't sit idle if it was the cause of the hang. Every time GDB has completed a step, it has to update any expressions you required to print. It may include following pointers and so, and in some case, it may fail there (although I don't remind of a real "hang"). It also typically try to update your stack trace. If the stack trace has been corrupted and is no longer coherent, it could be trapped into an endless loop. Attaching gdb to strace to see what kind of activity is going on during the hang could be a good way to go one step further into figuring out the problem.
(e.g. accessing sources through a no-longer-working NFS/SSHFS mount is one of the most frequent reason for gdb to hang, here :P)

Related

Can a Linux process/thread terminate without pass through do_exit()?

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.
The target binary is a Golang produced one, and it is heavily multi thread.
The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.
The LKM works, more or less.
During some conditions, however, I have a scenario I'm not able to explain.
It seems like a process/thread could terminate without passing through do_exit().
The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.
I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.
Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.
The described setup seems to confirm a process can terminate without pass through the do_exit() function.
But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().
Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.
Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.
My question is then the following:
Can a Linux process terminate without its flow pass through the do_exit() function?
If so, can someone give me a hint of what this scenario can be?
After a long debug session, I'm finally able to answer my own question.
That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.
Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.
Because this my original question:
Can a Linux process terminate without pass through the do_exit() function?
As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.
But this answer is in contrast with my observations, and the problem leading me to this question is still there.
Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.
My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.
To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:
Can two processes share the same PID?
If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID."
Linux is more complex, though.
There's a situation in which, in a Linux system, two different processes can share the same PID!
https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141
Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.
updated to correct an error
The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.
end of the update
That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.
For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17.
Current 5.10.3, is still this way.
Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:
Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.
Can a Linux process terminate without its flow pass through the do_exit() function?
Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.
The LKM works, more or less.
What does that means? How could a kernel module half-work?
And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.
See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)
Look also inside the source code of your libc, e.g. GNU libc or musl-libc
Of course, see Linux From Scratch and Advanced Linux Programming
And verifying if the PID is currently running,
This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)
PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.

gdb coredump - invoke function or continue execution

I have looked for a similar question to mine but the closest I found was Continue debugging after SegFault in GDB
My goal is to invoke a function in GDB from a coredump. I have a C++ type which has the operator<< defined and I would like to pretty print this type without having to write a pretty printer in python.
From what I have found so far it seems that GDB treats coredumps differently from regular inferiors which have active execution and hence cannot invoke functions (as is confirmed by this answer which states you can't use continue for a coredump).
Though I can see why that might be the case by default, could GDB be patched to load a coredump into memory and invoke it as a running process? If the program terminated due to a Seg. Fault then it is possible to manually alter the broken path after loading the core in GDB and potentially even continue executing the program after.
Though I can see why that might be the case by default, could GDB be patched to load a coredump into memory and invoke it as a running process?
In theory, this is possible. In practice, you'll find that this is a major undertaking, which will (likely) take you at least several weeks, if not several months.
I would like to pretty print this type without having to write a pretty printer in python.
I can assure you that writing a pretty-printer in Python is a lot less work.
If the program terminated due to a Seg. Fault then it is possible to manually alter the broken path after loading the core in GDB and potentially even continue executing the program after.
Not if the program uses any OS-level resources which are not captured in the core dump. All sockets, files, SysV IPCs, etc. etc. will be gone. This answer agrees.

What's the exact difference between gdb and actual OS environment for multiprocess?

I've been debugging multi-process job. Create multiple threads at program initialization. I found while I'm using gdb for debug, the threads can all be set up successfully, but when I execute the program directly in linux environment, it stucks after part of the threads being created. I'm thinking it must be some schedule problem between thread sleep and wakeup but haven't figure that out yet..
And although gdb can create the threads successfully, it quits with an unexpected segmentation fault in a glibc function after thread killed itself:
res_thread_freeres () at res_init.c:642
642 if (_res.nscount == 0)
which is also wierd because I can check the value of _res.nscount, it didn't overflow definately.
So.. Does anybody have a clue about the execution difference between an actual os and gdb debug environment? Thanks!
Update:
I've located the problem to pthread being set to SCHED_FIFO, after I removed this, it works fine. But I'm still not aware of why the program works fine in gdb environment.. Actually the thread state of the program got changed the moment it is attached to gdb.

Can the operating system restart a process that is stuck in infinite loop?

The other day, when doing testing on a Linux server, we observed that under some conditions, one process could die and then started again. After checking the code, we found it was caused by an infinite loop.
This aroused my curiosity how the process went dead and then got started? Is it the OS who detects and determines the abnormal process and get it restarted? If yes, how does that work?
Let's assume you won't be able to fix your code... And let's ignore all crazy options like attaching gdb via script or so.
You can either check CPU usage (most accidental infinite loops that I've done used 100% of CPU for hours :) ), or (more likely option) use strace to check what the software is doing right now and implement your own signature tracing (if those 20 APIs repeats 20 times let's assume infinite loop or so).
For example:
#!/bin/bash
strace -p`cat your_app.pid` | ./your_signature_evaluator
# Or
strace -p12345 | ./your_signature_evaluator
As for automatic system recognition... It seems normal that program crashes after calling things in loop uncontrollably (for example malloc() until you deplete memory, opening files...), but I've (and correct me in comment if I'm wrong) never seen system (kernel) restarting the app. I think you've either:
have conditions (signal handling, whatever) inside program that helps to recover
you're running a watchdog (check every 20 seconds that <pid> is running and if not start new instance)
you're running distribution that provides service/program configuration with restart if stopped
But I really doubt that Linux would be so nice to your application on it's own.
If it could the person that wrote that kernel will have solved the halting problem
PS: Vytor - Web servers are in an infinite loop and do not use 100% CPU.

Program stalls during long runs

Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch

Resources