I know this question has been asked before, but I have read all the threads and I didn't find an answer.
From the moment I execure run to start debugging my project, I get this : Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]. When I do ctrl+c, gdb tells me : Program received signal SIGINT, Interrupt.
0x00000000 in ?? ()
Usually it'll tell me which file and which function it got interrupted at not 0x00000000 in ?? ()
GDB no longer hits breakpoints, and what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
when I try to get info about the threads using info threads here's what I get :
[New Thread 20]
[New Thread 21]
[New Thread 22]
Id Target Id Frame
4 Thread 22 (ssp=0xa9004d5c) 0x00000000 in ?? ()
3 Thread 21 (ssp=0xa9002e64) 0x00000010 in ?? ()
2 Thread 20 (ssp=0xa9000ef4) 0x00000000 in ?? ()
The current thread <Thread ID 1> has terminated. See `help thread'
there's no thread 6, there's no * to indicate which thread gdb is using. And I don't even know if that's linked to the problem.
Can anyone please help me?
You are not providing nearly enough info to help you. Details matter, and you are withholding them. Versions of GDB and gdbserver matter, how you invoke GDB and gdbserver matter, what warnings you receive from GDB (if any) matter.
Now, this error message:
Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]
usually means that gdbserver has not attached one of the threads of your process, and that thread has tried to execute breakpoint instruction (you do have breakpoints set before this happens, don't you?).
One of the reasons this may happen is when your GDB loads "wrong" libthread_db.so (one that doesn't match the target libc.so.6).
what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
I am not sure what you mean by "same session", but it's probably not "when he types commands, they work; but when I type the same commands into the same GDB, they don't".
One difference between you and your colleague could be LD_LIBRATY_PATH environment variable setting. Another could be in ~/.gdbinit or in ./.gdbinit.
I suggest running gdb -nx to get rid of the latter, and unsetting LD_LIBRARY_PATH to get rid of the former.
The problem with the whole thing and for some reason no one seemed to notice it is this :
this is how I call gdb /usr/local/build/gdbx.y/gdb/gdb what I should be doing is this : /usr/local/build/gdbx.y/build/gdb/gdb
It was a path problem.
Related
I noticed that one thread's backtrace looks like:
Thread 8 (Thread 0x7f385f185700 (LWP 12861)):
#0 0x00007f38655a3cf4 in __mcount_internal (frompc=4287758, selfpc=4287663) at mcount.c:72
#1 0x00007f38655a4ac4 in mcount () at ../sysdeps/x86_64/_mcount.S:47
#2 0x0000000000000005 in ?? ()
#3 0x00007f382c02ece0 in ?? ()
#4 0x000000000000002d in ?? ()
#5 0x000000000000ffff in ?? ()
#6 0x0000000000000005 in ?? ()
#7 0x0000000000000005 in ?? ()
#8 0x0000000000000000 in ?? ()
It seems to be exited thread but I am not sure.
I would like know how to understand it. Especially, I don't understand what does it mean LWP and Thread 0x7f385f185700 ( what is that address)?
Moreover, I noticed that profiler indicates that __mcount_internal takes relatively a lot of time. What is it and why it is time-consuming? Especially, what are frompc and selfpc counters?
The my kernel is Linux 4.4.0 and thread are compatible with POSIX ( C++11 implementaion).
LWP = Light Weight Process, and means thread. Linux threads each have their own thread-ID, from the same sequence as PID numbers, even though they're not a separate process. If you look in /proc/PID/task for a multi-threaded process, you will see the entries for each thread ID.
0x7f385f185700 is the pthread ID, as from pthread_self(3). This is a pointer to a pthread_t.
This thread is stopped at RIP = 0x00007f38655a3cf4, the address in frame #0.
frompc and selfpc are function arguments to the __mcount_internal() glibc function.
Your backtrace can show names and args them because you have debug symbols installed for glibc. You just get ?? for the parent functions because you don't have debug info installed for the program or library containing them. (Compile your own program with -g, and install packages like qtbase5-dbg or libglib2.0-0-dbg to get debug symbols for libraries packed by your distro).
mcount seems to be related to profiling (i.e. code generated by -fprofile-generate or -pg). That would explain why it takes Program Counter values as args.
Why do applications compiled by GCC always contain the _mcount symbol?
That thread has not exited. You wouldn't see as many details if it had. (And probably wouldn't see it at all.)
I'm trying to analyze the core dump of one of my applications, but I'm not able to find the reason for the crash.
When I run gdb binary file corefile I see the following output:
Program terminated with signal SIGKILL, Killed.
#0 0xfedcdf74 in _so_accept () from /usr/lib/libc.so.1
(gdb)
But I am pretty sure that no one has executed kill -9 <pid>. With info thread, I can see all the threads launched by the application, but I can see nothing special about any thread.
By running bt full or maint info sol-threads I don't find anything that leads to the bug. I just see the stack trace for each thread without any information about the bug.
Finally I've found a thread which causes the kill signal.
#0 0xfedcebd4 in _lwp_kill () from /usr/lib/libc.so.1
#1 0xfed67bb8 in raise () from /usr/lib/libc.so.1
#2 0xfed429f8 in abort () from /usr/lib/libc.so.1
#3 0xff0684a8 in __cxxabiv1::__terminate(void (*)()) () from /usr/local/lib/libstdc++.so.5
#4 0xff0684f4 in std::terminate() () from /usr/local/lib/libstdc++.so.5
#5 0xff068dd8 in __cxa_pure_virtual () from /usr/local/lib/libstdc++.so.5
#6 0x00017f40 in A::method (this=0x2538d8) at A.cc:351
Class A inherits of an abstact class and in the line 351 a virtual function declared in the abstract class and defined in A is called. I donĀ“t understand why if object A exists the call to the virtual base function crashes.
That SIGKILL could be caused by your app exceeding some resource limit. Try to get the system log and see if there are any resource limit exceeded messages.
References
Solaris Administration Guide: Resource Controls
I am trying to debug a crash in gdb where is core dumped on this thread. There is other 40+ threads going on at the same time. How do I figure out where this thread 42 is started from?
Also, why the last line (frame #0) is not showing up?
Thread 42 (Thread 0x2aaba65ce940 (LWP 15854)):
#0 0x0000003a95605b03 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#1 0x0000003a9560684b in start_thread () from /lib64/libpthread.so.0
#2 0x0000003a946d526d in clone () from /lib64/libc.so.6
#3 0x0000000000000000 in ?? ()
I am using gdb version 7.7
How do I figure out where this thread 42 is started from?
You can't: neither GDB, nor the OS keeps track of "who started this thread". (It is also often quite useless to know where a particular thread was created).
What you could do is either put instrumentation into your own calls to pthread_create and log "thread X created thread Y", or use catch syscall clone, and print creation stack traces in GDB, then match them later to the crashed thread (match its LWP to the return value of clone earler).
Also, why the last line (frame #0) is not showing up?
You mean frame #3. It doesn't exist -- clone is where the thread is borne (comes to existence).
P.S. Installing libc debug symbols so you can see where inside __nptl_deallocate_tsd the thread crashed is more likely to provide clues than knowing thread creation details.
Question:
When a process is killed, is this information recorded anywhere (i.e., in kernel), such as syslog (or can be configured to be recorded syslog.conf)
Is the information of the killer's PID, time and date when killed and reason
update - you have all giving me some insight, thank you very much|
If your Linux kernel is compiled with the process accounting (CONFIG_BSD_PROCESS_ACT) option enabled, you can start recording process accounting info using the accton(8) command and use sa(8) to access the recorded info. The recorded information includes the 32 bit exit code which includes the signal number.
(This stuff is not widely known / used these days, but I still remember it from the days of 4.x Bsd on VAXes ...)
Amended:
In short, the OS kernel does not care if the process is killed. That is dependant on whether the process logs anything. All the kernel cares about at this stage is reclaiming memory. But read on, on how to catch it and log it...
As per caf and Stephen C's mention on their comments...
If you are running BSD accounting daemon module in the kernel, everything gets logged. Thanks to Stephen C for pointing this out! I did not realize that functionality as I have this switched off/disabled.
In my hindsight, as per caf's comment - the two signals that cannot be caught are SIGKILL and SIGSTOP, and also the fact that I mentioned atexit, and I described in the code, that should have been exit(0);..ooops Thanks caf!
Original
The best way to catch the kill signal is you need to use a signal handler to handle a few signals , not just SIGKILL on its own will suffice, SIGABRT (abort), SIGQUIT (terminal program quit), SIGSTOP and SIGHUP (hangup). Those signals together is what would catch the command kill on the command line. The signal handler can then log the information stored in /var/log/messages (environment dependant or Linux distribution dependant). For further reference, see here.
Also, see here for an example of how to use a signal handler using the sigaction function.
Also it would be a good idea to adopt the usage of atexit function, then when the code exits at runtime, the runtime will execute the last function before returning back to the command line. Reference for atexit is here.
When the C function exit is used, and executed, the atexit function will execute the function pointer where applied as in the example below. - Thanks caf for this!
An example usage of atexit as shown:
#include <stdlib.h>
int main(int argc, char **argv){
atexit(myexitfunc); /* Beginning, immediately right after declaration(s) */
/* Rest of code */
return 0;
exit(0);
}
int myexitfunc(void){
fprintf(stdout, "Goodbye cruel world...\n");
}
Hope this helps,
Best regards,
Tom.
I don't know of any logging of signals sent to processes, unless the OOM killer is doing it.
If you're writing your own program you can catch the kill signal and write to a logfile before actually dying. This doesn't work with kill -9 though, just the normal kill.
You can see some details over thisaway.
If you use sudo, it will be logged. Other than that, the killed process can log some information (unless it's being terminated with extreme prejudice). You could even hack the kernel to log signals.
As for recording the reason a process was killed, I've yet to see a psychic program.
Kernel hacking is not for the weak of heart, but hella fun. You'd need to patch the signal dispatch routines to log information using printk(9) when kill(3), sigsend(2) or the like is called. Read "The Linux Signals Handling Model" for more information on how signals are handled.
If the process is getting it via kill(2), then unless the process is already logging the only external trace would be a kernel mod. It's pretty simple; just do a printk(), it's like printf(). Find the output in dmesg.
If the process is getting it via /bin/kill, then it would be a relatively easy matter to install a wrapper executable that did logging. But this (signal delivery via /bin/kill) is unlikely because kill is also a bash built-in.
By the way, if a process is killed with a signal is announced by the kernel to the parent process via de wait(2) system call. The value returned by this call is the exit status of the child (the lower byte) and some signal related info in the upper byte in case this process has been killed. See wait(2) for more information.
I've set breakpoints on exit and _exit and my program (multithreaded app, running on linux 2.6.16.46-0.12 sles10), is somehow still exiting in a way I can't locate
(gdb) c
...
[New Thread 47513671297344 (LWP 15279)]
[New Thread 47513667103040 (LWP 15280)]
[New Thread 47513662908736 (LWP 15281)]
Program exited with code 0177.
(gdb)
the exit functions reside in libc so there's no deferred load shared library issues. Anybody know of some other mysterious trigger for exit that can't be caught?
EDIT: the problem is now academic only. I tried binary search debugging, backing out a subset of my changes (the problem went away). After I applied them again in sequence, I can no longer repro the problem, even with things restored to the original state.
EDIT2: I found one reason for this sort of error recently, which may have been the original source for this problem. For historical reasons our product uses the evil linker flag -Bsymbolic. Among the side effects of this is that when a symbol is undefined but called, the GLIBC runtime linker will bomb in exactly this way, and you see it in the debugger as a process exited with 0177. When the runtime linker aborts this way, I'd guess it makes the syscall to _exit directly (rather than using the C runtime library exit() or _exit()). That would be consistent with the fact that I was unable to catch this with an the exit breakpoints in the debugger.
There are two common reasons for _exit breakpoint to "miss" -- either GDB didn't set the breakpoint in the right place, or the program performs (a moral equivalent of) syscall(SYS_exit, ...)
What do info break and disassemble _exit say?
You might be able to convince GDB to set the breakpoint correctly with break *&_exit. Alternatively, GDB-7.0 supports catch syscall. Something like this should work (assuming Linux/x86_64; note that on ix86 the numbers will be different) regardless of how the program exits:
(gdb) catch syscall 60
Catchpoint 3 (syscall 'exit' [60])
(gdb) catch syscall 231
Catchpoint 4 (syscall 'exit_group' [231])
(gdb) c
Catchpoint 4 (call to syscall 'exit_group'), 0x00007ffff7912f3d in _exit () from /lib/libc.so.6
Update:
Your comment indicates that _exit breakpoint is set correctly, so it's likely that your process just doesn't execute _exit.
That leaves syscall(SYS_exit, ...) and one other possibility (which I missed before): all threads executing pthread_exit. You might want to set a breakpoint on pthread_exit as well (and execute info thread each time you hit it -- the last thread to do pthread_exit will cause the process to terminate).
Edit:
Also worth noting that you can use mnemonic names, rather than syscall numbers. You can also simultaneously add multiple syscalls to the catch list like so:
(gdb) catch syscall exit exit_group
Catchpoint 2 (syscalls 'exit' [1] 'exit_group' [252])
Setting the breakpoint on _exit was a good idea.
You might also try linking statically, just to take a stack of potential gdb complications off the table.
0177 is suspiciously like the wait status wait(2) returns for child stopped, but gdb is printing the exit status, which is a different thing, so that's probably a real exit argument.
It might be that you have some lazy references unresolved in some shared library loaded into process. I have exactly the same situation that "someone somewhere" exited process and that appeared to be unresolved reference.
Check your process with "ldd -r" option.
Looks like ld.so or whatever does lazy resolving of some symbols to uniform exit function (which should be abort IMHO).
My situation:
$ ldd ./program
undefined symbol: XXXX (/usr/lib/libYYY.so)
$./program
program: started!
...
<program is running regardless of undefined references>
Now exit appeared when I've invoked some scenario that used function that was undefined. It always exited with exitcode=127 and gdb reported 0177.