game renderer thread backtrace with no symbols on linux - linux

I have a game application running in linux. We are a gaming company. I am having this random crash that occurs like once in 24-48 hours. The last time it occurred I tried to see the backtrace of the thread where it crashed, however gdb showed that the stack was corrupted with no symbols.
Now, when I run the game and interrupt the gdb, sometimes I am able to see function call stack for this thread but most of the times I do not see any symbols.The thread is a renderer thread.
Some of the game libraries we are using is proprietary third party with no debugging symbols. So I was wondering could it be that the renderer thread call stack is deep(various calls within library) into these libraries without symbols and so I do not get to see the call stack ? If that is true, how can I fix this ?
If not, any idea what could be the cause.

(gdb) bt
#0 0x9f488882 in ?? ()
Also, did a info proc mappings and for the address above in bt I found the following:
0x9f488000 0x9f48a000 0x2000 0x0 /tmp/glyFI8DP (deleted)
This means that your third-party library is using just-in-time compilation to generate some code, mmap it into your process, and deletes it.
On x86_64, GDB needs unwind descriptors to unwind the stack, but it can't get them from the deleted file, so you get no stack trace.
You have a few options:
contact the third-party developers and ask them "how can we get stack traces in this situation?"
dump the contents of the region with GDB dump command:
(gdb) dump /tmp/gly.so 0x9f488000 0x9f48a000
If you are lucky, the resulting binary would actually be an ELF (it doesn't have to be), and may have symbols and unwind descriptors in it. Use readelf --all /tmp/gly.so to look inside.
If it is an ELF file, you can let GDB know that that's what's mapped at 0x9f488000. You'll need to find the address ($tstart below) of .text section in it (should be in readelf output), then:
(gdb) add-symbol-file /tmp/gly.so 0x9f488000+$tstart

Related

How to get the gdb call stack trace?

I have a core dump and a file where debug information is stored, can I use gdb without using an executable file to get a call stack with the name of functions and lines?
can I use gdb without using an executable file to get a call stack with the name of functions and lines?
At least on Linux/x86_64, the answer is no: the info saved after objcopy --only-keep-debug is not sufficient; you also need the executable file.
This is happening (at least in part) because the debug_file does not have the .eh_frame section, which is necessary for unwinding on x86_64.
If you are debugging the core dumps yourself, there is no reason to create debug_file -- just keep the original executable with full debug info for debugging (you can still ship a smaller stripped file to execution machines).

How to collect a minimum debug data from a truncated core of Linux C program? with or without GDB?

How to collect a minimum debug data from a truncated core of Linux C program?
I tried some hours to reproduce the segmentation-fault signal with the same program but I did not succed. I only succeded to get it once. I don't know how to extract a minimum information with gdb. I thougth I was experienced with gdb but now I think I have to learn a lot again...
Perhaps someone knows another debugger than GDB to try something with. By minimum information I mean the calling function which produces a segfault signal.
My core file is 32MB and GDB indicates the core allocated memory requiered is 700MB. How to know if the deeper stack function concerned by the segfault is identified inside the core file or not?
From the name of the file I already know the concerned thread name but it is not enougth to debug the program.
I collected the /proc/$PID/maps file of the main program but I don't know if it is usefull to retrieve the segfault function.
Moreover, how to know if the segfault signal was produced from inside the thread or if it came from outside the thread?

Two questions regarding threads

If there is a missing/corrupted library in the gdb core how do I isolate it?
I also read that there is a possibility the thread could have overwritten its own stack , how do I detect that ?
how do I isolate the above problems with the below bt ?
/etc/gdb/gdbinit:105: Error in sourced command file:
Error while executing Python code.
Reading symbols from /opt/hsp/bin/addrman...done.
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
warning: Corrupted shared library list: 0x0 != 0x4ed700
warning: no loadable sections found in added symbol-file system-supplied DSO at
0x7ffd50ff6000
Core was generated by `addrman --notification-socket
/opt/hsp/sockets/memb_notify.socket'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
292 ./ps/ps_lock_track.h: No such file or directory.
(gdb) bt
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
#1 0x0000000000000000 in ?? ()
It looks like the core file is corrupt, likely due to heap or stack corruption. Corruption is oftentimes the result of a buffer overflow or other undefined behavior.
If you are running on Linux, I would try valgrind. It can oftentimes spot corruption very quickly. Windows has some similar tools.
Yes, a multithreaded application can overflow the stack. Each thread is only allocated a limited amount. This usually only happens if you have very deep function call stack or you are allocating large local object on the stack.
Some interesting information here and here on setting the stack size for Linux applications.
Faced with your problem, I would:
Check all the callers of the lktrk_locer_set method. Carefully investigate each, if possible, to see if there is obvious stack overflow or heap corruption
Try to use Valgrind or similar tools to spot the issue
Add debug logging to isolate the issue
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
The above error is usually a sign that you gave GDB different system libraries (or the main binary) from the ones used when the core dump was produced.
Either you are analyzing a "production" core dump on a development machine, or you've upgraded system libraries between the time core dump was produced and when you are analyzing it, or you've rebuilt the main binary.
See this answer for what to do if one of the above is correct.

Why is "echo l > /proc/sysrq-trigger" call trace output always similar?

According to the official kernel.org documentation echo l > /proc/sysrq-trigger is supposed to give me the current call trace of all CPUs. But when I do this a couple of times and look into dmesg after that the call traces look completely similar. Why is that?
The same backtrace explanation
In your case, your CPU #0 backtrace is showing that it's executing your sysrq command (judging by write_sysrq_trigger() function):
delay_tsc+0x1f/0x70
arch_trigger_all_cpu_backtrace+0x10a/0x140
__handle_sysrq+0xfc/0x160
write_sysrq_trigger+0x2b/0x30
proc_reg_write+0x39/0x70
vfs_write+0xb2/0x1f0
SyS_write+0x42/0xa0
system_call_fast_compare_end+0x10/0x15
and CPU #1 backtrace is showing that it's in IDLE state (judging by cpuidle_enter_state() function):
cpuidle_enter_state+0x40/0xc0
cpu_startup_entry+0x2f8/0x400
start_secondary+0x20f/0x2d0
Try to load your system very intensively, and then execute your sysrq command to get new backtraces. You will see that one CPU is executing your sysrq command, and second CPU is not in IDLE anymore, but doing some actual work.
User-space backtrace
As for user-space functions on kernel backtrace: although system call is executing (in kernel space) on behalf of user-space process (see Comm: bash in your backtrace for CPU0), it's not possible to print user-space process backtrace using standard kernel backtrace mechanism (which implemented in dump_stack() function). The problem is that the kernel stack doesn't contain any user-space process calls (that's why you can see only kernel functions in your backtraces).
User-space process calls can be found in user-space stack for the corresponding process. For this purpose I would recommend you to use OProfile profiler. Of course, it will give you just a binary stack. In order to obtain actual function names you will need to provide symbols information to gdb.
Details:
[1] kernel stack and user-space stack
[2] how to dump kernel stack in syscall
[3] How to print the userspace stack trace in linux kernelspace

access a process's kernel stack given process id in kernel debugging

I have a linux running on VMWare, and I use gdb in the host machine to attach to it when debugging. While running, my kernel will cause some of the processes hang, and I would like to investigate more.
What kernel gives me is the process id of the hung process along with a stack trace. However, without the arguments being passed, stack trace is not very useful. So I want to gather more information. So I have two questions:
Given the pid, how can I get the task_struct corresponds to the process? I tried to do " p find_task_by_pid_ns(2533, &init_pid_ns) " under gdb, however it hangs.
Once I got the task_struct and the stack pointer. My ultimate goal would be to reproduce the stack trace (with argument of each functioned called). Is there a tool to do that? Does gdb take a stack pointer and print the stack trace for me?
Thanks.
KDB will be helpful in this case. I don't know which kernel version you are using, but if you are using kernel on or after linux-2.6.35, you can switch to the kdb from gdb using the following command:
maintenance packet 3
Once you are in the kdb you can use ps command to get to know process descriptor address and can use bt command to trace a stack. Alternatively, you can run the kdb commands from the gdb using gdb 'monitor' command. For example, to use the 'ps' command of kdb, you can type the following command in your gdb.
(gdb) monitor ps
You can get the list of kdb command using the following command.
(gdb) monitor help
Once you know the process descriptor, you can use the following documentation to trace any process's stack.
http://www.emntech.com/documentation/debugging/kdb.pdf

Resources