Different Stack Pointer at the same point for multiple runs - linux

I am encountering the following behavior for multiple runs of a program on a Linux x86 machine:
the command run is env -i ./prog. The program is crafted to fail with a SIGSEGV
the output from dmesg shows that the Stack Pointer at the time of the SIGSEGV varies after each execution, even though the program's flow is exactly the same
I have disabled the ASLR.
Why is the SP varying after each execution?

Related

How to find reason for SEGFAULT in large multi-thread C program?

I am working with a multi-thread program for Linux embedded systems that crashes very randomly through SEGFAULT signal.
I need to find the issue WITHOUT using gdb since the crash occurs only under production environment, never during testing.
I know the symbol table of the program and I'm using sigaction() and backtrace() in main thread but I don't get enough information. The backtraced lines are from sigaction function itself. I allow 50 frames to be captured and I use -g flag in gcc for compilation:
Caught segfault at address 0xe76acc
Obtained 3 stack frames.
./mbca(_Z11print_tracev+0x20) [0x37530]
./mbca(_Z18segfault_sigactioniP7siginfoPv+0x34) [0x375f4]
/lib/libc.so.6(__default_rt_sa_restorer_v2+0) [0x40db5a60]
As the program is running 15 threads, I would like to get a clue about from which one is coming the signal so I can limit the possibilities. FYI Main thread creates a fork and that fork creates the remaining 14 threads.
How can I achive this? What can I do with the information I already have?
Thank you all for your help
PD: I also tried Core-dump file but it is not generated because this option was not included in kernel compilation and I cannot modify it.

Why is a segmentation fault not reproducible in gdb?

I have a situation where I run a number of unit tests and one of them triggers a segmentation fault. The symptom seems related to another test case run roughly 30 test cases prior to the failing one. Obviously there is some dependency between the test cases and I can easily turn on and off the segmentation fault by commenting out the earlier test case. Google Test/Mock 1.6.0 is used as test framework. The test binary is written entirely in C++ (gcc 4.6.3). It is single threaded (unless Google Test creates threads).
However, when I run all test cases in gdb there is no segmentation fault and this is what puzzles me.
What are realistic reasons why there would be a segmentation fault when running a binary in a terminal, but not when running the very same binary through gdb? I guess everything is slightly slower when gdb runs the code, but I don't see how this would affect the outcome.
I just do this to see no fault:
gdb MyBinary
run
Last lines of terminal printout:
[ PASSED ] 368 tests.
[Inferior 1 (process 28349) exited normally]
And this to see the fault:
MyBinary
Last line of terminal printout:
Segmentation fault
What are realistic reasons why there would be a segmentation fault when running a binary in a terminal, but not when running the very same binary through gdb?
The two most common ones are:
GDB disables address space randomization. If you are reading some uninitialized pointer, and that pointer always happens to be NULL under GDB, but may not be NULL with ASLR.
You have a data race, and GDB slows down thread creation to hide that race (GDB has to do a lot of work to keep track of all threads).
You can prevent GDB from disabling ASLR with set disable-randomization off.
You should probably check your tests using MemorySanitizer and ThreadSanitizer.

Why is "echo l > /proc/sysrq-trigger" call trace output always similar?

According to the official kernel.org documentation echo l > /proc/sysrq-trigger is supposed to give me the current call trace of all CPUs. But when I do this a couple of times and look into dmesg after that the call traces look completely similar. Why is that?
The same backtrace explanation
In your case, your CPU #0 backtrace is showing that it's executing your sysrq command (judging by write_sysrq_trigger() function):
delay_tsc+0x1f/0x70
arch_trigger_all_cpu_backtrace+0x10a/0x140
__handle_sysrq+0xfc/0x160
write_sysrq_trigger+0x2b/0x30
proc_reg_write+0x39/0x70
vfs_write+0xb2/0x1f0
SyS_write+0x42/0xa0
system_call_fast_compare_end+0x10/0x15
and CPU #1 backtrace is showing that it's in IDLE state (judging by cpuidle_enter_state() function):
cpuidle_enter_state+0x40/0xc0
cpu_startup_entry+0x2f8/0x400
start_secondary+0x20f/0x2d0
Try to load your system very intensively, and then execute your sysrq command to get new backtraces. You will see that one CPU is executing your sysrq command, and second CPU is not in IDLE anymore, but doing some actual work.
User-space backtrace
As for user-space functions on kernel backtrace: although system call is executing (in kernel space) on behalf of user-space process (see Comm: bash in your backtrace for CPU0), it's not possible to print user-space process backtrace using standard kernel backtrace mechanism (which implemented in dump_stack() function). The problem is that the kernel stack doesn't contain any user-space process calls (that's why you can see only kernel functions in your backtraces).
User-space process calls can be found in user-space stack for the corresponding process. For this purpose I would recommend you to use OProfile profiler. Of course, it will give you just a binary stack. In order to obtain actual function names you will need to provide symbols information to gdb.
Details:
[1] kernel stack and user-space stack
[2] how to dump kernel stack in syscall
[3] How to print the userspace stack trace in linux kernelspace

Generating core dumps

From times to times my Go program crashes.
I tried a few things in order to get core dumps generated for this program:
defining ulimit on the system, I tried both ulimit -c unlimited and ulimit -c 10000 just in case. After launching my panicking program, I get no core dump.
I also added recover() support in my program and added code to log to syslog in case of panic but I get nothing in syslog.
I am running out of ideas right now.
I must have overlooked something but I do not find what, any help would be appreciated.
Thanks ! :)
Note that a core dump is generated by the OS when a condition from a certain set is met. These conditions are pretty low-level — like trying to access unmapped memory or trying to execute an opcode the CPU does not know etc. Under a POSIX operating system such as Linux when a process does one of these things, an appropriate signal is sent to it, and some of them, if not handled by the process, have a default action of generating a core dump, which is done by the OS if not prohibited by setting a certain limit.
Now observe that this machinery treats a process on the lowest possible level (machine code), but the binaries a Go compiler produces are more higher-level that those a C compiler (or assembler) produces, and this means certain errors in a process produced by a Go compiler are handled by the Go runtime rather than the OS. For instance, a typical NULL pointer dereference in a process produced by a C compiler usually results in sending the process the SIGSEGV signal which is then typically results in an attempt to dump the process' core and terminate it. In contrast, when this happens in a process compiled by a Go compiler, the Go runtime kicks in and panics, producing a nice stack trace for debugging purposes.
With these facts in mind, I would try to do this:
Wrap your program in a shell script which first relaxes the limit for core dumps (but see below) and then runs your program with its standard error stream redirected to a file (or piped to the logger binary etc).
The limits a user can tweak have a hierarchy: there are soft and hard limits — see this and this for an explanation. So try checking your system does not have 0 for the core dump size set as a hard limit as this would explain why your attempt to raise this limit has no effect.
At least on my Debian systems, when a program dies due to SIGSEGV, this fact is logged by the kernel and is visible in the syslog log files, so try grepping them for hints.
First, please make sure all errors are handled.
For core dump, you can refer generate a core dump in linux
You can use supervisor to reboot the program when it crashes.

access a process's kernel stack given process id in kernel debugging

I have a linux running on VMWare, and I use gdb in the host machine to attach to it when debugging. While running, my kernel will cause some of the processes hang, and I would like to investigate more.
What kernel gives me is the process id of the hung process along with a stack trace. However, without the arguments being passed, stack trace is not very useful. So I want to gather more information. So I have two questions:
Given the pid, how can I get the task_struct corresponds to the process? I tried to do " p find_task_by_pid_ns(2533, &init_pid_ns) " under gdb, however it hangs.
Once I got the task_struct and the stack pointer. My ultimate goal would be to reproduce the stack trace (with argument of each functioned called). Is there a tool to do that? Does gdb take a stack pointer and print the stack trace for me?
Thanks.
KDB will be helpful in this case. I don't know which kernel version you are using, but if you are using kernel on or after linux-2.6.35, you can switch to the kdb from gdb using the following command:
maintenance packet 3
Once you are in the kdb you can use ps command to get to know process descriptor address and can use bt command to trace a stack. Alternatively, you can run the kdb commands from the gdb using gdb 'monitor' command. For example, to use the 'ps' command of kdb, you can type the following command in your gdb.
(gdb) monitor ps
You can get the list of kdb command using the following command.
(gdb) monitor help
Once you know the process descriptor, you can use the following documentation to trace any process's stack.
http://www.emntech.com/documentation/debugging/kdb.pdf

Resources