setting a gdb exit breakpoint not working? - linux

I've set breakpoints on exit and _exit and my program (multithreaded app, running on linux 2.6.16.46-0.12 sles10), is somehow still exiting in a way I can't locate
(gdb) c
...
[New Thread 47513671297344 (LWP 15279)]
[New Thread 47513667103040 (LWP 15280)]
[New Thread 47513662908736 (LWP 15281)]
Program exited with code 0177.
(gdb)
the exit functions reside in libc so there's no deferred load shared library issues. Anybody know of some other mysterious trigger for exit that can't be caught?
EDIT: the problem is now academic only. I tried binary search debugging, backing out a subset of my changes (the problem went away). After I applied them again in sequence, I can no longer repro the problem, even with things restored to the original state.
EDIT2: I found one reason for this sort of error recently, which may have been the original source for this problem. For historical reasons our product uses the evil linker flag -Bsymbolic. Among the side effects of this is that when a symbol is undefined but called, the GLIBC runtime linker will bomb in exactly this way, and you see it in the debugger as a process exited with 0177. When the runtime linker aborts this way, I'd guess it makes the syscall to _exit directly (rather than using the C runtime library exit() or _exit()). That would be consistent with the fact that I was unable to catch this with an the exit breakpoints in the debugger.

There are two common reasons for _exit breakpoint to "miss" -- either GDB didn't set the breakpoint in the right place, or the program performs (a moral equivalent of) syscall(SYS_exit, ...)
What do info break and disassemble _exit say?
You might be able to convince GDB to set the breakpoint correctly with break *&_exit. Alternatively, GDB-7.0 supports catch syscall. Something like this should work (assuming Linux/x86_64; note that on ix86 the numbers will be different) regardless of how the program exits:
(gdb) catch syscall 60
Catchpoint 3 (syscall 'exit' [60])
(gdb) catch syscall 231
Catchpoint 4 (syscall 'exit_group' [231])
(gdb) c
Catchpoint 4 (call to syscall 'exit_group'), 0x00007ffff7912f3d in _exit () from /lib/libc.so.6
Update:
Your comment indicates that _exit breakpoint is set correctly, so it's likely that your process just doesn't execute _exit.
That leaves syscall(SYS_exit, ...) and one other possibility (which I missed before): all threads executing pthread_exit. You might want to set a breakpoint on pthread_exit as well (and execute info thread each time you hit it -- the last thread to do pthread_exit will cause the process to terminate).
Edit:
Also worth noting that you can use mnemonic names, rather than syscall numbers. You can also simultaneously add multiple syscalls to the catch list like so:
(gdb) catch syscall exit exit_group
Catchpoint 2 (syscalls 'exit' [1] 'exit_group' [252])

Setting the breakpoint on _exit was a good idea.
You might also try linking statically, just to take a stack of potential gdb complications off the table.
0177 is suspiciously like the wait status wait(2) returns for child stopped, but gdb is printing the exit status, which is a different thing, so that's probably a real exit argument.

It might be that you have some lazy references unresolved in some shared library loaded into process. I have exactly the same situation that "someone somewhere" exited process and that appeared to be unresolved reference.
Check your process with "ldd -r" option.
Looks like ld.so or whatever does lazy resolving of some symbols to uniform exit function (which should be abort IMHO).
My situation:
$ ldd ./program
undefined symbol: XXXX (/usr/lib/libYYY.so)
$./program
program: started!
...
<program is running regardless of undefined references>
Now exit appeared when I've invoked some scenario that used function that was undefined. It always exited with exitcode=127 and gdb reported 0177.

Related

Why does a panic while panicking result in an illegal instruction?

Consider the following code that purposely causes a double panic:
use scopeguard::defer; // 1.1.0
fn main() {
defer!{ panic!() };
defer!{ panic!() };
}
I know this typically happens when a Drop implementation panics while unwinding from a previous panic, but why does it cause the program to issue an illegal instruction? That sounds like the code is corrupted or jumped somewhere unintended. I figure this might be system or code generation dependent but I tested on various platforms and they all issue similar errors with the same reason:
Linux:
thread panicked while panicking. aborting.
Illegal instruction (core dumped)
Windows (with cargo run):
thread panicked while panicking. aborting.
error: process didn't exit successfully: `target\debug\tests.exe` (exit code: 0xc000001d, STATUS_ILLEGAL_INSTRUCTION)
The Rust Playground:
thread panicked while panicking. aborting.
timeout: the monitored command dumped core
/playground/tools/entrypoint.sh: line 11: 8 Illegal instruction timeout --signal=KILL ${timeout} "$#"
What's going on? What causes this?
This behavior is intended.
From a comment by Jonas Schievink in Why does panicking in a Drop impl cause SIGILL?:
It calls intrinsics::abort(), which LLVM turns into a ub2 instruction, which is illegal, thus SIGILL
I couldn't find any documentation for how double panics are handled, but a paragraph for std::intrinsics::abort() lines up with this behavior:
The current implementation of intrinsics::abort is to invoke an invalid instruction, on most platforms. On Unix, the process will probably terminate with a signal like SIGABRT, SIGILL, SIGTRAP, SIGSEGV or SIGBUS. The precise behaviour is not guaranteed and not stable.
Curiously, this behavior is different from calling std::process::abort(), which always terminates with SIGABRT.
The illegal instruction of choice on x86 is UD2 (I think a typo in the comment above) a.k.a. an undefined instruction which is paradoxically reserved and documented to not be an instruction. So there is no corruption or invalid jump, just a quick and loud way to tell the OS that something has gone very wrong.

GDB Conditional Breakpoint in Multithreaded application

I'm using gdb to debug a multithreaded program.
The program produces a lot of threads that repeatedly process a string (identical thread function).
In that loop, I wish to break if the string in any of these threads equals a predefined constant string. For this I'm using break myfile:1234 if $_streq(managedStr->value(), "testString").
But this breakpoint never gets hit, eventhough I know for sure that managestStr takes the value "testString" for sure.
When leaving the condition, it breaks, but when trying to evaluate the condition manually then, I get an error
The program stopped in another thread while making a function call from GDB.
From this link i learned to set the scheduler-locking option while the program is paused, but setting this option from the very beginning didn't help with the breakpoint not being hit. Instead it will then get get
Error in testing breakpoint condition:
The program stopped in another thread while making a function call from GDB.
Evaluation of the expression containing the function
(ManageString::value()) will be abandoned.
When the function is done executing, GDB will silently stop.
[Switching to Thread 0x7ffd35fb3700 (LWP 24354)]
What can I do to achieve the desired behaviour?
break myfile:1234 if $_streq(managedStr->value(), "testString")
That should work, and the fact that it doesn't work implies a bug in GDB.
That said, this method of debugging is exceedingly slow, because GDB has to stop every thread at line 1234 and then examine the memory values (i.e. evaluate the condition).
Usually it is much faster to insert a little bit of helper code into the program:
if (strcmp(namagedStr->value(), "testString") == 0) {
printf("DEBUG: match\n"); // set breakpoint here!
}
and rebuild it.
Now instead of stopping every thread all the time and evaluating the condition, GDB will set a breakpoint on code that is only reached when the condition is true. This will be 100 to 1000 times faster (depending on number of threads).
This technique could be trivially extended if you need to vary the actual value you are looking for from run to run, or even within a single run, by adding a global variable:
const char *str_to_match = "testString"; // variable can be overwritten in GDB
if (strcmp(managedStr->value(), str_to_match) == 0) {
...
}

What could cause trace/breakpoint trap (core dumped)? [duplicate]

I know this question has been asked before, but I have read all the threads and I didn't find an answer.
From the moment I execure run to start debugging my project, I get this : Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]. When I do ctrl+c, gdb tells me : Program received signal SIGINT, Interrupt.
0x00000000 in ?? ()
Usually it'll tell me which file and which function it got interrupted at not 0x00000000 in ?? ()
GDB no longer hits breakpoints, and what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
when I try to get info about the threads using info threads here's what I get :
[New Thread 20]
[New Thread 21]
[New Thread 22]
Id Target Id Frame
4 Thread 22 (ssp=0xa9004d5c) 0x00000000 in ?? ()
3 Thread 21 (ssp=0xa9002e64) 0x00000010 in ?? ()
2 Thread 20 (ssp=0xa9000ef4) 0x00000000 in ?? ()
The current thread <Thread ID 1> has terminated. See `help thread'
there's no thread 6, there's no * to indicate which thread gdb is using. And I don't even know if that's linked to the problem.
Can anyone please help me?
You are not providing nearly enough info to help you. Details matter, and you are withholding them. Versions of GDB and gdbserver matter, how you invoke GDB and gdbserver matter, what warnings you receive from GDB (if any) matter.
Now, this error message:
Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]
usually means that gdbserver has not attached one of the threads of your process, and that thread has tried to execute breakpoint instruction (you do have breakpoints set before this happens, don't you?).
One of the reasons this may happen is when your GDB loads "wrong" libthread_db.so (one that doesn't match the target libc.so.6).
what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
I am not sure what you mean by "same session", but it's probably not "when he types commands, they work; but when I type the same commands into the same GDB, they don't".
One difference between you and your colleague could be LD_LIBRATY_PATH environment variable setting. Another could be in ~/.gdbinit or in ./.gdbinit.
I suggest running gdb -nx to get rid of the latter, and unsetting LD_LIBRARY_PATH to get rid of the former.
The problem with the whole thing and for some reason no one seemed to notice it is this :
this is how I call gdb /usr/local/build/gdbx.y/gdb/gdb what I should be doing is this : /usr/local/build/gdbx.y/build/gdb/gdb
It was a path problem.

How to debug a futex contention shown in strace?

I'm debugging an issue in a multi-threaded linux process, where a certain thread appears to not execute for few seconds. Looking at strace output revealed it waits for futex e.g.
1673109 14:36:28.600329 futex(0x44b8d20, FUTEX_WAIT_PRIVATE,
1673109 14:36:33.221850 <... futex resumed> ) = 0 <4.621514>
How can I find out out what this futex(0x44b8d20) refers to in user space, i.e. how to map this to a locking construct which is using futex internally.
I would use a simple systemtap script so that would help you to quickly find out addresses of contended futex locks. When I say address, I am referring to the first argument of futex() syscall.
You can download the simple system tap script that finds the contended user space locks here:
https://sourceware.org/systemtap/examples/process/futexes.stp
If you have systemtap installed on your system,
just start this system tap script: stap futexes.stp
Capture the strace output as you did before.
If you end the system tap script execution by simply doing Ctrl-C,
you will get the output of contended futexes.
Finally in your strace output,
search for futex calls in which the second argument (operation type) is FUTEX_WAIT.
For example : futex(0x7f58a31999d0, FUTEX_WAIT, 4508, NULL) = 0
Then you can search for the first argument in system tap script output.
Something like : ome[4489] lock 0x7f58a31999d0 contended 1 times, 7807 avg us
If you look at this system tap script,
it nicely prints the process name and process/thread id for you.
This makes it easy to find what you are looking for.
However one note is that executing the systemtap script will actually hook a syscall system wide.

when a process is killed is this information recorded anywhere?

Question:
When a process is killed, is this information recorded anywhere (i.e., in kernel), such as syslog (or can be configured to be recorded syslog.conf)
Is the information of the killer's PID, time and date when killed and reason
update - you have all giving me some insight, thank you very much|
If your Linux kernel is compiled with the process accounting (CONFIG_BSD_PROCESS_ACT) option enabled, you can start recording process accounting info using the accton(8) command and use sa(8) to access the recorded info. The recorded information includes the 32 bit exit code which includes the signal number.
(This stuff is not widely known / used these days, but I still remember it from the days of 4.x Bsd on VAXes ...)
Amended:
In short, the OS kernel does not care if the process is killed. That is dependant on whether the process logs anything. All the kernel cares about at this stage is reclaiming memory. But read on, on how to catch it and log it...
As per caf and Stephen C's mention on their comments...
If you are running BSD accounting daemon module in the kernel, everything gets logged. Thanks to Stephen C for pointing this out! I did not realize that functionality as I have this switched off/disabled.
In my hindsight, as per caf's comment - the two signals that cannot be caught are SIGKILL and SIGSTOP, and also the fact that I mentioned atexit, and I described in the code, that should have been exit(0);..ooops Thanks caf!
Original
The best way to catch the kill signal is you need to use a signal handler to handle a few signals , not just SIGKILL on its own will suffice, SIGABRT (abort), SIGQUIT (terminal program quit), SIGSTOP and SIGHUP (hangup). Those signals together is what would catch the command kill on the command line. The signal handler can then log the information stored in /var/log/messages (environment dependant or Linux distribution dependant). For further reference, see here.
Also, see here for an example of how to use a signal handler using the sigaction function.
Also it would be a good idea to adopt the usage of atexit function, then when the code exits at runtime, the runtime will execute the last function before returning back to the command line. Reference for atexit is here.
When the C function exit is used, and executed, the atexit function will execute the function pointer where applied as in the example below. - Thanks caf for this!
An example usage of atexit as shown:
#include <stdlib.h>
int main(int argc, char **argv){
atexit(myexitfunc); /* Beginning, immediately right after declaration(s) */
/* Rest of code */
return 0;
exit(0);
}
int myexitfunc(void){
fprintf(stdout, "Goodbye cruel world...\n");
}
Hope this helps,
Best regards,
Tom.
I don't know of any logging of signals sent to processes, unless the OOM killer is doing it.
If you're writing your own program you can catch the kill signal and write to a logfile before actually dying. This doesn't work with kill -9 though, just the normal kill.
You can see some details over thisaway.
If you use sudo, it will be logged. Other than that, the killed process can log some information (unless it's being terminated with extreme prejudice). You could even hack the kernel to log signals.
As for recording the reason a process was killed, I've yet to see a psychic program.
Kernel hacking is not for the weak of heart, but hella fun. You'd need to patch the signal dispatch routines to log information using printk(9) when kill(3), sigsend(2) or the like is called. Read "The Linux Signals Handling Model" for more information on how signals are handled.
If the process is getting it via kill(2), then unless the process is already logging the only external trace would be a kernel mod. It's pretty simple; just do a printk(), it's like printf(). Find the output in dmesg.
If the process is getting it via /bin/kill, then it would be a relatively easy matter to install a wrapper executable that did logging. But this (signal delivery via /bin/kill) is unlikely because kill is also a bash built-in.
By the way, if a process is killed with a signal is announced by the kernel to the parent process via de wait(2) system call. The value returned by this call is the exit status of the child (the lower byte) and some signal related info in the upper byte in case this process has been killed. See wait(2) for more information.

Resources