How to know who started a thread - multithreading

I am trying to debug a crash in gdb where is core dumped on this thread. There is other 40+ threads going on at the same time. How do I figure out where this thread 42 is started from?
Also, why the last line (frame #0) is not showing up?
Thread 42 (Thread 0x2aaba65ce940 (LWP 15854)):
#0 0x0000003a95605b03 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#1 0x0000003a9560684b in start_thread () from /lib64/libpthread.so.0
#2 0x0000003a946d526d in clone () from /lib64/libc.so.6
#3 0x0000000000000000 in ?? ()
I am using gdb version 7.7

How do I figure out where this thread 42 is started from?
You can't: neither GDB, nor the OS keeps track of "who started this thread". (It is also often quite useless to know where a particular thread was created).
What you could do is either put instrumentation into your own calls to pthread_create and log "thread X created thread Y", or use catch syscall clone, and print creation stack traces in GDB, then match them later to the crashed thread (match its LWP to the return value of clone earler).
Also, why the last line (frame #0) is not showing up?
You mean frame #3. It doesn't exist -- clone is where the thread is borne (comes to existence).
P.S. Installing libc debug symbols so you can see where inside __nptl_deallocate_tsd the thread crashed is more likely to provide clues than knowing thread creation details.

Related

How to access data written in a socket without reading it

I'm working on an embedded system having Linux.
A client thread is writing some data in the socket but what server thread is reading on the other side isn't the same as it was written. Which is causing the thread (and parent process) to crash.
I'm new to networking and Linux.
I have dumped every piece of data which is being written, it's all fine.
The function trace in gdb shows the following information.
(gdb)
#0 0x00007f62be8e8670 in getenv () from /lib/libc.so.6
#1 0x00007f62be92057a in __libc_message () from /lib/libc.so.6
#2 0x00007f62be99f927 in __fortify_fail () from /lib/libc.so.6
#3 0x00007f62be99f8f0 in __stack_chk_fail () from /lib/libc.so.6
#4 0x0000000000406471 in reading (sockFd=15) at __line_number_in_the_program__
#5 0x793bcf318b18bb01 in ?? ()
#6 0x117d0300942ff567 in ?? ()
#7 0x0000000100000000 in ?? ()
..
..
..
It goes till #785 with some [random] address.
reading() is the function which processes the read data in the server thread.
I suspect there something going wrong inside the socket.
Is there any way to see the data which is in the sockets(client/server) buffer without reading it?
Or any other way to debug it further with gdb?
There are already some checks to handle the read data properly but those are also not helping.
You have a stack buffer overflow problem. If you have never heard of the GCC stack protector, now is the time to look it up. Whilst Wireshark is the obvious tool for looking at data in flight, this is not the locus of your problem. Your server should be proof against any and all malicious data read from the network. This is basic good server design and implementation practice. You have a stack buffer overflow problem.

What could cause trace/breakpoint trap (core dumped)? [duplicate]

I know this question has been asked before, but I have read all the threads and I didn't find an answer.
From the moment I execure run to start debugging my project, I get this : Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]. When I do ctrl+c, gdb tells me : Program received signal SIGINT, Interrupt.
0x00000000 in ?? ()
Usually it'll tell me which file and which function it got interrupted at not 0x00000000 in ?? ()
GDB no longer hits breakpoints, and what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
when I try to get info about the threads using info threads here's what I get :
[New Thread 20]
[New Thread 21]
[New Thread 22]
Id Target Id Frame
4 Thread 22 (ssp=0xa9004d5c) 0x00000000 in ?? ()
3 Thread 21 (ssp=0xa9002e64) 0x00000010 in ?? ()
2 Thread 20 (ssp=0xa9000ef4) 0x00000000 in ?? ()
The current thread <Thread ID 1> has terminated. See `help thread'
there's no thread 6, there's no * to indicate which thread gdb is using. And I don't even know if that's linked to the problem.
Can anyone please help me?
You are not providing nearly enough info to help you. Details matter, and you are withholding them. Versions of GDB and gdbserver matter, how you invoke GDB and gdbserver matter, what warnings you receive from GDB (if any) matter.
Now, this error message:
Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]
usually means that gdbserver has not attached one of the threads of your process, and that thread has tried to execute breakpoint instruction (you do have breakpoints set before this happens, don't you?).
One of the reasons this may happen is when your GDB loads "wrong" libthread_db.so (one that doesn't match the target libc.so.6).
what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
I am not sure what you mean by "same session", but it's probably not "when he types commands, they work; but when I type the same commands into the same GDB, they don't".
One difference between you and your colleague could be LD_LIBRATY_PATH environment variable setting. Another could be in ~/.gdbinit or in ./.gdbinit.
I suggest running gdb -nx to get rid of the latter, and unsetting LD_LIBRARY_PATH to get rid of the former.
The problem with the whole thing and for some reason no one seemed to notice it is this :
this is how I call gdb /usr/local/build/gdbx.y/gdb/gdb what I should be doing is this : /usr/local/build/gdbx.y/build/gdb/gdb
It was a path problem.

Backtrace for exited thread.

I noticed that one thread's backtrace looks like:
Thread 8 (Thread 0x7f385f185700 (LWP 12861)):
#0 0x00007f38655a3cf4 in __mcount_internal (frompc=4287758, selfpc=4287663) at mcount.c:72
#1 0x00007f38655a4ac4 in mcount () at ../sysdeps/x86_64/_mcount.S:47
#2 0x0000000000000005 in ?? ()
#3 0x00007f382c02ece0 in ?? ()
#4 0x000000000000002d in ?? ()
#5 0x000000000000ffff in ?? ()
#6 0x0000000000000005 in ?? ()
#7 0x0000000000000005 in ?? ()
#8 0x0000000000000000 in ?? ()
It seems to be exited thread but I am not sure.
I would like know how to understand it. Especially, I don't understand what does it mean LWP and Thread 0x7f385f185700 ( what is that address)?
Moreover, I noticed that profiler indicates that __mcount_internal takes relatively a lot of time. What is it and why it is time-consuming? Especially, what are frompc and selfpc counters?
The my kernel is Linux 4.4.0 and thread are compatible with POSIX ( C++11 implementaion).
LWP = Light Weight Process, and means thread. Linux threads each have their own thread-ID, from the same sequence as PID numbers, even though they're not a separate process. If you look in /proc/PID/task for a multi-threaded process, you will see the entries for each thread ID.
0x7f385f185700 is the pthread ID, as from pthread_self(3). This is a pointer to a pthread_t.
This thread is stopped at RIP = 0x00007f38655a3cf4, the address in frame #0.
frompc and selfpc are function arguments to the __mcount_internal() glibc function.
Your backtrace can show names and args them because you have debug symbols installed for glibc. You just get ?? for the parent functions because you don't have debug info installed for the program or library containing them. (Compile your own program with -g, and install packages like qtbase5-dbg or libglib2.0-0-dbg to get debug symbols for libraries packed by your distro).
mcount seems to be related to profiling (i.e. code generated by -fprofile-generate or -pg). That would explain why it takes Program Counter values as args.
Why do applications compiled by GCC always contain the _mcount symbol?
That thread has not exited. You wouldn't see as many details if it had. (And probably wouldn't see it at all.)

How do I view the crash reason in a core dump?

I'm trying to analyze the core dump of one of my applications, but I'm not able to find the reason for the crash.
When I run gdb binary file corefile I see the following output:
Program terminated with signal SIGKILL, Killed.
#0 0xfedcdf74 in _so_accept () from /usr/lib/libc.so.1
(gdb)
But I am pretty sure that no one has executed kill -9 <pid>. With info thread, I can see all the threads launched by the application, but I can see nothing special about any thread.
By running bt full or maint info sol-threads I don't find anything that leads to the bug. I just see the stack trace for each thread without any information about the bug.
Finally I've found a thread which causes the kill signal.
#0 0xfedcebd4 in _lwp_kill () from /usr/lib/libc.so.1
#1 0xfed67bb8 in raise () from /usr/lib/libc.so.1
#2 0xfed429f8 in abort () from /usr/lib/libc.so.1
#3 0xff0684a8 in __cxxabiv1::__terminate(void (*)()) () from /usr/local/lib/libstdc++.so.5
#4 0xff0684f4 in std::terminate() () from /usr/local/lib/libstdc++.so.5
#5 0xff068dd8 in __cxa_pure_virtual () from /usr/local/lib/libstdc++.so.5
#6 0x00017f40 in A::method (this=0x2538d8) at A.cc:351
Class A inherits of an abstact class and in the line 351 a virtual function declared in the abstract class and defined in A is called. I donĀ“t understand why if object A exists the call to the virtual base function crashes.
That SIGKILL could be caused by your app exceeding some resource limit. Try to get the system log and see if there are any resource limit exceeded messages.
References
Solaris Administration Guide: Resource Controls

How to stop gdb from showing "New Thread" and "Thread exited"?

Is there a way to tell gdb not to show messages of the form
[New Thread 0x7fffc8ff9700 (LWP 32104)]
[Thread 0x7fffc8ff9700 (LWP 32104) exited]
i Have to debug an application with millions of these messages which slow down everything -
i can't seem to get to the problematic code...
(gdb) set print thread-events off
Documentation here.

Resources