Two questions regarding threads - multithreading

If there is a missing/corrupted library in the gdb core how do I isolate it?
I also read that there is a possibility the thread could have overwritten its own stack , how do I detect that ?
how do I isolate the above problems with the below bt ?
/etc/gdb/gdbinit:105: Error in sourced command file:
Error while executing Python code.
Reading symbols from /opt/hsp/bin/addrman...done.
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
warning: Corrupted shared library list: 0x0 != 0x4ed700
warning: no loadable sections found in added symbol-file system-supplied DSO at
0x7ffd50ff6000
Core was generated by `addrman --notification-socket
/opt/hsp/sockets/memb_notify.socket'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
292 ./ps/ps_lock_track.h: No such file or directory.
(gdb) bt
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
#1 0x0000000000000000 in ?? ()

It looks like the core file is corrupt, likely due to heap or stack corruption. Corruption is oftentimes the result of a buffer overflow or other undefined behavior.
If you are running on Linux, I would try valgrind. It can oftentimes spot corruption very quickly. Windows has some similar tools.
Yes, a multithreaded application can overflow the stack. Each thread is only allocated a limited amount. This usually only happens if you have very deep function call stack or you are allocating large local object on the stack.
Some interesting information here and here on setting the stack size for Linux applications.
Faced with your problem, I would:
Check all the callers of the lktrk_locer_set method. Carefully investigate each, if possible, to see if there is obvious stack overflow or heap corruption
Try to use Valgrind or similar tools to spot the issue
Add debug logging to isolate the issue

warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
The above error is usually a sign that you gave GDB different system libraries (or the main binary) from the ones used when the core dump was produced.
Either you are analyzing a "production" core dump on a development machine, or you've upgraded system libraries between the time core dump was produced and when you are analyzing it, or you've rebuilt the main binary.
See this answer for what to do if one of the above is correct.

Related

How to collect a minimum debug data from a truncated core of Linux C program? with or without GDB?

How to collect a minimum debug data from a truncated core of Linux C program?
I tried some hours to reproduce the segmentation-fault signal with the same program but I did not succed. I only succeded to get it once. I don't know how to extract a minimum information with gdb. I thougth I was experienced with gdb but now I think I have to learn a lot again...
Perhaps someone knows another debugger than GDB to try something with. By minimum information I mean the calling function which produces a segfault signal.
My core file is 32MB and GDB indicates the core allocated memory requiered is 700MB. How to know if the deeper stack function concerned by the segfault is identified inside the core file or not?
From the name of the file I already know the concerned thread name but it is not enougth to debug the program.
I collected the /proc/$PID/maps file of the main program but I don't know if it is usefull to retrieve the segfault function.
Moreover, how to know if the segfault signal was produced from inside the thread or if it came from outside the thread?

game renderer thread backtrace with no symbols on linux

I have a game application running in linux. We are a gaming company. I am having this random crash that occurs like once in 24-48 hours. The last time it occurred I tried to see the backtrace of the thread where it crashed, however gdb showed that the stack was corrupted with no symbols.
Now, when I run the game and interrupt the gdb, sometimes I am able to see function call stack for this thread but most of the times I do not see any symbols.The thread is a renderer thread.
Some of the game libraries we are using is proprietary third party with no debugging symbols. So I was wondering could it be that the renderer thread call stack is deep(various calls within library) into these libraries without symbols and so I do not get to see the call stack ? If that is true, how can I fix this ?
If not, any idea what could be the cause.
(gdb) bt
#0 0x9f488882 in ?? ()
Also, did a info proc mappings and for the address above in bt I found the following:
0x9f488000 0x9f48a000 0x2000 0x0 /tmp/glyFI8DP (deleted)
This means that your third-party library is using just-in-time compilation to generate some code, mmap it into your process, and deletes it.
On x86_64, GDB needs unwind descriptors to unwind the stack, but it can't get them from the deleted file, so you get no stack trace.
You have a few options:
contact the third-party developers and ask them "how can we get stack traces in this situation?"
dump the contents of the region with GDB dump command:
(gdb) dump /tmp/gly.so 0x9f488000 0x9f48a000
If you are lucky, the resulting binary would actually be an ELF (it doesn't have to be), and may have symbols and unwind descriptors in it. Use readelf --all /tmp/gly.so to look inside.
If it is an ELF file, you can let GDB know that that's what's mapped at 0x9f488000. You'll need to find the address ($tstart below) of .text section in it (should be in readelf output), then:
(gdb) add-symbol-file /tmp/gly.so 0x9f488000+$tstart

Linux coredump backtrace missing frames

I got a core dump from a multi-threaded process segmentation fault crash. While inspecting the core file using GDB, I found some threads (not all) having such backtrace:
Thread 4 (LWP 3344):
#0 0x405ced04 in select () from /lib/arm-linux-gnueabi/libc.so.6
#1 0x405cecf8 in select () from /lib/arm-linux-gnueabi/libc.so.6
#2 0x000007d0 in ?? ()
#3 0x000007d0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I check in our source code and found those threads do eventually call select(). I'd like to understand why/how those middle frames are omitted.
Such pattern also occurs to read() call.
Any idea what is going on here? I'm afraid that this indicates something wrong with our coredump configuration, or something. Thank you in advance for the help!!
Edit:
Thanks for all responses. I apologize I didn't give enough info. Here are more:
The executable is built with compiler -g and without any optimizations, using -O0.
We generally only used less than half of our RAM 300-400 MB/1G.
Actually, I also saw this pattern backtrace in different core files (dumped for different faults).
What makes this symptom really wired (differ from ordinary stack corrupt) is that more than one threads have such back trace pattern, with frame #0, #1 exactly the same as this one, but #2 #3 addresses may differ from this.
It looks like you might be compiling without debugging enabled.
Make sure you pass in -g when compiling.
Also, as Joachim Pileborg mentioned in his comment, the repeated stack frame implies that you probably corrupted your stack somewhere. That is usually the result of writing past the end of an array, into the stored frame pointer.
If you want to check segmentation faults which are causing due to memory related problem or want to check leak of memory than its better to use Valgrind which gives all information regarding memory leak.

Understanding GDB and Segfault Messages

I was recently debugging an application that was segfaulting on a regular basis--I solved the problem, which was relatively mundane (reading from a null pointer), but I have a few residual questions I've been unable to solve on my own.
The gdb stack trace began like this in most cases:
0x00007fdff330059f in __strlen_sse42 () from /lib64/libc.so.6
Using information from /proc/[my proc id]/maps to attain the base address of the shared library, I could see that the problem occurred at the same instruction of the shared library--at instruction 0x13259f, which is
pcmpeqb (%rdi),%xmm1 (gdb)
So far, so good. But then, the OS (linux) would also write out an error message to /var/logs/messags, that looks like this
[3540502.783205] node[24638]: segfault at 0 ip 00007f8abbe6459f sp 00007fff7bf2f148 error 4 in libc-2.12.so[7f8abbd32000+189000]
Which confuses me. On the one hand, the kernel correctly identifies the fault (a user-mode protection fault), and, by subtracting the base address of the shared library from the instruction pointer, we arrive at the same relative offset--0x13259f--as we do by gdb. But the library the kernel identifies is different, the address of the instruction is different, and the function and instruction within that library is different. That is, the instruction within libc-2-12.so is
0x13259f <__memset_sse2+911>: movdqa %xmm0,-0x43(%edx)
So, my question is, how can gdb and the kernel message agree on the type of fault, and on the offset of the instruction relative to the base address of the shared library, but disagree on the address of the instruction pointer and the shared library being used?
But the library the kernel identifies is different,
No, it isn't. Do ls -l /lib64/libc.so.6, and you'll see that it's a symlink to libc-2.12.so.
the address of the instruction is different
The kernel message is for a different execution from the one you've observed in GDB, and address randomization caused libc-2.12.so to be loaded at a different base address.
and the function and instruction within that library is different. That is, the instruction within libc-2-12.so is 0x13259f <__memset_sse2+911>: movdqa %xmm0,-0x43(%edx)
It is likely that you looked at a different libc-2.12.so from the one that is actually used.

Porting Unix ada app to Linux: Seg fault before program begins

I am an intern who was offered the task of porting a test application from Solaris to Red Hat. The application is written in Ada. It works just fine on the Unix side. I compiled it on the linux side, but now it is giving me a seg fault. I ran the debugger to see where the fault was and got this:
Warning: In non-Ada task, selecting an Ada task.
=> runtime tasking structures have not yet been initialized.
<non-Ada task> with thread id 0b7fe46c0
process received signal "Segmentation fault" [11]
task #1 stopped in _dl_allocate_tls
at 0870b71b: mov edx, [edi] ;edx := [edi]
This seg fault happens before any calls are made or anything is initialized. I have been told that 'tasks' in ada get started before the rest of the program, and the problem could be with a task that is running.
But here is the kicker. This program just generates some code for another program to use. The OTHER program, when compiled under linux gives me the same kind of seg fault with the same kind of error message. This leads me to believe there might be some little tweak I can use to fix all of this, but I just don't have enough knowledge about Unix, Linux, and Ada to figure this one out all by myself.
This is a total shot in the dark, but you can have tasks blow up like this at startup if they are trying to allocate too much local memory on the stack. Your main program can safely use the system stack, but tasks have to have their stack allocated at startup from dynamic memory, so typcially your runtime has a default stack size for tasks. If your task tries to allocate a large array, it can easily blow past that limit. I've had it happen to me before.
There are multiple ways to fix this. One way is to move all your task-local data into package global areas. Another is to dynamically allocate it all.
If you can figure out how much memory would be enough, you have a couple more options. You can make the task a task type, and then use a
for My_Task_Type_Name'Storage_Size use Some_Huge_Number;
statement. You can also use a "pragma Storage_Size(My_Task_Type_Name)", but I think the "for" statement is preferred.
Lastly, with Gnat you can also change the default task stack size with the -d flag to gnatbind.
Off the top of my head, if the code was used on Sparc machines, and you're now runing on an x86 machine, you may be running into endian problems.
It's not much help, but it is a common gotcha when going multiplat.
Hunch: the linking step didn't go right. Perhaps the wrong run-time startup library got linked in?
(How likely to find out what the real trouble was, months after the question was asked?)

Resources