gdb coredump - invoke function or continue execution - linux

I have looked for a similar question to mine but the closest I found was Continue debugging after SegFault in GDB
My goal is to invoke a function in GDB from a coredump. I have a C++ type which has the operator<< defined and I would like to pretty print this type without having to write a pretty printer in python.
From what I have found so far it seems that GDB treats coredumps differently from regular inferiors which have active execution and hence cannot invoke functions (as is confirmed by this answer which states you can't use continue for a coredump).
Though I can see why that might be the case by default, could GDB be patched to load a coredump into memory and invoke it as a running process? If the program terminated due to a Seg. Fault then it is possible to manually alter the broken path after loading the core in GDB and potentially even continue executing the program after.

Though I can see why that might be the case by default, could GDB be patched to load a coredump into memory and invoke it as a running process?
In theory, this is possible. In practice, you'll find that this is a major undertaking, which will (likely) take you at least several weeks, if not several months.
I would like to pretty print this type without having to write a pretty printer in python.
I can assure you that writing a pretty-printer in Python is a lot less work.
If the program terminated due to a Seg. Fault then it is possible to manually alter the broken path after loading the core in GDB and potentially even continue executing the program after.
Not if the program uses any OS-level resources which are not captured in the core dump. All sockets, files, SysV IPCs, etc. etc. will be gone. This answer agrees.

Related

Can a Linux process/thread terminate without pass through do_exit()?

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.
The target binary is a Golang produced one, and it is heavily multi thread.
The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.
The LKM works, more or less.
During some conditions, however, I have a scenario I'm not able to explain.
It seems like a process/thread could terminate without passing through do_exit().
The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.
I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.
Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.
The described setup seems to confirm a process can terminate without pass through the do_exit() function.
But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().
Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.
Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.
My question is then the following:
Can a Linux process terminate without its flow pass through the do_exit() function?
If so, can someone give me a hint of what this scenario can be?
After a long debug session, I'm finally able to answer my own question.
That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.
Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.
Because this my original question:
Can a Linux process terminate without pass through the do_exit() function?
As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.
But this answer is in contrast with my observations, and the problem leading me to this question is still there.
Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.
My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.
To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:
Can two processes share the same PID?
If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID."
Linux is more complex, though.
There's a situation in which, in a Linux system, two different processes can share the same PID!
https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141
Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.
updated to correct an error
The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.
end of the update
That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.
For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17.
Current 5.10.3, is still this way.
Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:
Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.
Can a Linux process terminate without its flow pass through the do_exit() function?
Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.
The LKM works, more or less.
What does that means? How could a kernel module half-work?
And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.
See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)
Look also inside the source code of your libc, e.g. GNU libc or musl-libc
Of course, see Linux From Scratch and Advanced Linux Programming
And verifying if the PID is currently running,
This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)
PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.

Can I launch the process when it was generating core dump?

I have a monitor script will check a specified process, if it crash, the script will relaunch it without waiting for the core dump writing complete. Does this incur bad things? Will it affect the core dump file or the relaunched process?
Yes, you can. A process is a different thing than a program. As you can have several instances of the ls command in unix running in parallel, there's nothing to impede you to relaunch the same program (but a different, new process) again while it is saving the core file. The only difference from an normal process writing a file is that the process writing a core just does it in kernel mode. Nothing else.
Core dump is executed by the process killed executing in kernel mode, as a previous to die task. For the purposes of process state, the process in state exiting and nothing can affect it until the core dump is finished (it can only be interrupted by a write error in the dump file, or perhaps this is an interruptible state)
The only problem you can have, is that the next instance you launch, as it tries to write the same core file name, will have to wait for it to end (i think the inode is only locked on a per write basis only, not for the whole file) and you get a bunch of processes dying and writing the same core file. That's not the case if the core happens to a new, different file (the file is unlinked before creating it) but that depends on implementation. Probably an exploit should be a DOS attack to begin generating cores at a high pace, to make the writing of core files to queue a lot of processes in non interrupting state. But I think this is difficult to achieve... most probably only you'll get high load by many processes writing different core files just to be erased next (as a consequence of the unlink system call made by then next core generating task).
A core(5) dump is very bad, and you should fix its root cause. It is generally the result of some unexpected and unhandled signal(7) (perhaps some memory corruption giving a SIGSEGV, etc...; read also about undefined behavior and be very scared of UB).
if it crash, the script will relaunch it without waiting for the core dump writing complete.
So your approach is flawed, except as a temporary measure. BTW, in many cases, the virtual address space of the faulty process is small enough for the core to be dumped in a small fraction of a second. In some cases, the dumping of the core might take many minutes (think of a big HPC process dealing with hundreds of gigabytes of data on a supercomputer).
It is rumored that, in the previous century, some huge core files took half an hour to be dumped on Cray supercomputers.
You really should fix your program to avoid dumping core.
We don't know at all what is your buggy program which dumps core. But if it has some persistent state (e.g. in some database or some file) which you care about, your approach is very wrong: the core dump might perhaps happen in the code which produces that state, and then, if you restart the same program, it could reuse that faulty state.
Does this incur bad things?
Yes in general. Perhaps not in your specific case (but we don't know what your program is doing).
So, you'll better understand why is that core happening. In general, you would compile your program with all warnings and debug info (so gcc -Wall -Wextra -g with GCC) and use gdb to analyze post-mortem the core dump (see this).
You really should not write programs which dump core (even if that happens to all of us; but it is a strong bug that should be fixed ASAP). And you should not accept core dumps as an acceptable behavior of your programs.
The core dumps are here to help the developer to fix some serious problem. Read also about Unix philosophy. It is socially unacceptable to consider as "normal" a core dump, which is definitely an abnormal program behavior.
(there are several ways to avoid core dumps; but that makes a different question; and you need to explain what kind of programs you are writing and monitoring, and why and how it is dumping core.)

execution behavior is different when the program is run without debugger and with it

I am running a program in linux. The behavior of the program is different when i run it in ddd debugger and without it. That is the program halts at different points. Why is it so? Is it debugger dependent or it happens sometimes with every debugger?
Your problem description is not very precise but it sounds like a memory access issue.
When you have invalid memory access in your code, the behavior is undefined and may be different with gdb connected. For memory errors, you should try to run a memory profiler like Valgrind.

Why would gdb hang?

I have an application that I am debugging and I'm trying to understand how gdb works and why I am not able to step through the application sometimes. The problem that I am experiencing is that gdb will hang and the process it is attached to will enter a defunct state when I am stepping through the program. After gdb hangs and I have to kill it to free the terminal (ctrl-C does not work, I have to do this from a different terminal window by getting the process id for that gdb session and using kill -9).
I'm guessing that gdb is hanging because it's waiting for the application to stop at the next instruction and somehow the application finished execution without gdb identifying this. But that's just speculation on my part from the behavior I've observed thus far. So my question is if anyone has seen this type of behavior before and/or could suggest what the cause might be. I think that might help me improve my debugging strategy.
In case it matters I'm using g++ 4.4.3, gdb 7.1, running on Ubuntu 10.04 x86_64.
I had a similar problem and solved it by sending a CONT signal to the process being debugged.
I'd say the debugged process wouldn't sit idle if it was the cause of the hang. Every time GDB has completed a step, it has to update any expressions you required to print. It may include following pointers and so, and in some case, it may fail there (although I don't remind of a real "hang"). It also typically try to update your stack trace. If the stack trace has been corrupted and is no longer coherent, it could be trapped into an endless loop. Attaching gdb to strace to see what kind of activity is going on during the hang could be a good way to go one step further into figuring out the problem.
(e.g. accessing sources through a no-longer-working NFS/SSHFS mount is one of the most frequent reason for gdb to hang, here :P)

Program stalls during long runs

Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch

Resources