Segfault on stack overflow - linux

Why does the linux kernel generate a segfault on stack overflow? This can make debugging very awkward when alloca in c or fortran creation of temporary arrays overflows. Surely it mjust be possible for the runtime to produce a more helpful error.

You can actually catch the condition for a stack overflow using signal handlers.
To do this, you must do two things:
Setup a signal handler for SIGSEGV (the segfault) using sigaction, to do this set the SO_ONSTACK flag. This instructs the kernel to use an alternative stack when delivering the signal.
Call sigaltstack() to setup the alternate stack that the handler for SIGSEGV will use.
Then when you overflow the stack, the kernel will switch to your alternate stack before delivering the signal. Once in your signal handler, you can examine the address that caused the fault and determine if it was a stack overflow, or a regular fault.

The "kernel" (it's actually not the kernel running your code, it's the CPU) doesn't know how your code is referencing the memory it's not supposed to be touching. It only knows that you tried to do it.
The code:
char *x = alloca(100);
char y = x[150];
can't really be evaluated by the CPU as you trying to access beyond the bounds of x.
You may hit the exact same address with:
char y = *((char*)(0xdeadbeef));
BTW, I would discourage the use of alloca since stack tends to be much more limited than heap (use malloc instead).

A stack overflow is a segmentation fault. As in you've broken the given bounds of memory that the you were initially allocated. The stack of of finite size, and you have exceeded it. You can read more about it at wikipedia
Additionally, one thing I've done for projects in the past is write my own signal handler to segfault (look at man page signal (2)). I usually caught the signal and wrote out "Fatal error has occured" to the console. I did some further stuff with checkpoint flags, and debugging.
In order to debug segfaults you can run a program in GDB. For example, the following C program will segfault:
#segfault.c
#include
#include
int main()
{
printf("Starting\n");
void *foo=malloc(1000);
memcpy(foo, 0, 100); //this line will segfault
exit(0);
}
If I compile it like so:
gcc -g -o segfault segfault.c
and then run it like so:
$ gdb ./segfault
GNU gdb 6.7.1
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run
Starting program: /tmp/segfault
Starting
Program received signal SIGSEGV, Segmentation fault.
0x4ea43cbc in memcpy () from /lib/libc.so.6
(gdb) bt
#0 0x4ea43cbc in memcpy () from /lib/libc.so.6
#1 0x080484cb in main () at segfault.c:8
(gdb)
I find out from GDB that there was a segmentation fault on line 8. Of course there are more complex ways of handling stack overflows and other memory errors, but this will suffice.

Simply use Valgrind. It will point out all your memory allocation mistakes with excruciating preciseness.

A stack overflow does not necessarily yield a crash. It may silently trash data of your program but continue to execute.
I wouldn't use SIGSEGV handler kludges but instead fix the original problem.
If you want automated help, you can use gcc's -Wstack-protector option, which will spot some overflows at runtime and abort the program.
valgrind is good for dynamic memory allocation bugs, but not for stack errors.

Some of the comments are helpful, but the problem is not of memory allocation errors. That is there is no mistake in the code. It's quite a nuisance in fortran where the runtime allocates temporary values on the stack. Thus a command such as
write(fp)x,y,z
can trigger are segfault with no warning. The technical support for the intel Fortran compiler say that there is no way that the runtime library can print a more helpful message. However if Miguel is right than this should be possible as he suggests. So thanks a lot. The remaining question then is how do I firstly find the address of the seg fault and the figure out if it came from a stack overflow or some other problem.
For others who find this problem there is a compiler flag which puts temporary varibles above a certain size on the heap.

Related

Two questions regarding threads

If there is a missing/corrupted library in the gdb core how do I isolate it?
I also read that there is a possibility the thread could have overwritten its own stack , how do I detect that ?
how do I isolate the above problems with the below bt ?
/etc/gdb/gdbinit:105: Error in sourced command file:
Error while executing Python code.
Reading symbols from /opt/hsp/bin/addrman...done.
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
warning: Corrupted shared library list: 0x0 != 0x4ed700
warning: no loadable sections found in added symbol-file system-supplied DSO at
0x7ffd50ff6000
Core was generated by `addrman --notification-socket
/opt/hsp/sockets/memb_notify.socket'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
292 ./ps/ps_lock_track.h: No such file or directory.
(gdb) bt
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
#1 0x0000000000000000 in ?? ()
It looks like the core file is corrupt, likely due to heap or stack corruption. Corruption is oftentimes the result of a buffer overflow or other undefined behavior.
If you are running on Linux, I would try valgrind. It can oftentimes spot corruption very quickly. Windows has some similar tools.
Yes, a multithreaded application can overflow the stack. Each thread is only allocated a limited amount. This usually only happens if you have very deep function call stack or you are allocating large local object on the stack.
Some interesting information here and here on setting the stack size for Linux applications.
Faced with your problem, I would:
Check all the callers of the lktrk_locer_set method. Carefully investigate each, if possible, to see if there is obvious stack overflow or heap corruption
Try to use Valgrind or similar tools to spot the issue
Add debug logging to isolate the issue
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
The above error is usually a sign that you gave GDB different system libraries (or the main binary) from the ones used when the core dump was produced.
Either you are analyzing a "production" core dump on a development machine, or you've upgraded system libraries between the time core dump was produced and when you are analyzing it, or you've rebuilt the main binary.
See this answer for what to do if one of the above is correct.

game renderer thread backtrace with no symbols on linux

I have a game application running in linux. We are a gaming company. I am having this random crash that occurs like once in 24-48 hours. The last time it occurred I tried to see the backtrace of the thread where it crashed, however gdb showed that the stack was corrupted with no symbols.
Now, when I run the game and interrupt the gdb, sometimes I am able to see function call stack for this thread but most of the times I do not see any symbols.The thread is a renderer thread.
Some of the game libraries we are using is proprietary third party with no debugging symbols. So I was wondering could it be that the renderer thread call stack is deep(various calls within library) into these libraries without symbols and so I do not get to see the call stack ? If that is true, how can I fix this ?
If not, any idea what could be the cause.
(gdb) bt
#0 0x9f488882 in ?? ()
Also, did a info proc mappings and for the address above in bt I found the following:
0x9f488000 0x9f48a000 0x2000 0x0 /tmp/glyFI8DP (deleted)
This means that your third-party library is using just-in-time compilation to generate some code, mmap it into your process, and deletes it.
On x86_64, GDB needs unwind descriptors to unwind the stack, but it can't get them from the deleted file, so you get no stack trace.
You have a few options:
contact the third-party developers and ask them "how can we get stack traces in this situation?"
dump the contents of the region with GDB dump command:
(gdb) dump /tmp/gly.so 0x9f488000 0x9f48a000
If you are lucky, the resulting binary would actually be an ELF (it doesn't have to be), and may have symbols and unwind descriptors in it. Use readelf --all /tmp/gly.so to look inside.
If it is an ELF file, you can let GDB know that that's what's mapped at 0x9f488000. You'll need to find the address ($tstart below) of .text section in it (should be in readelf output), then:
(gdb) add-symbol-file /tmp/gly.so 0x9f488000+$tstart

Why is "echo l > /proc/sysrq-trigger" call trace output always similar?

According to the official kernel.org documentation echo l > /proc/sysrq-trigger is supposed to give me the current call trace of all CPUs. But when I do this a couple of times and look into dmesg after that the call traces look completely similar. Why is that?
The same backtrace explanation
In your case, your CPU #0 backtrace is showing that it's executing your sysrq command (judging by write_sysrq_trigger() function):
delay_tsc+0x1f/0x70
arch_trigger_all_cpu_backtrace+0x10a/0x140
__handle_sysrq+0xfc/0x160
write_sysrq_trigger+0x2b/0x30
proc_reg_write+0x39/0x70
vfs_write+0xb2/0x1f0
SyS_write+0x42/0xa0
system_call_fast_compare_end+0x10/0x15
and CPU #1 backtrace is showing that it's in IDLE state (judging by cpuidle_enter_state() function):
cpuidle_enter_state+0x40/0xc0
cpu_startup_entry+0x2f8/0x400
start_secondary+0x20f/0x2d0
Try to load your system very intensively, and then execute your sysrq command to get new backtraces. You will see that one CPU is executing your sysrq command, and second CPU is not in IDLE anymore, but doing some actual work.
User-space backtrace
As for user-space functions on kernel backtrace: although system call is executing (in kernel space) on behalf of user-space process (see Comm: bash in your backtrace for CPU0), it's not possible to print user-space process backtrace using standard kernel backtrace mechanism (which implemented in dump_stack() function). The problem is that the kernel stack doesn't contain any user-space process calls (that's why you can see only kernel functions in your backtraces).
User-space process calls can be found in user-space stack for the corresponding process. For this purpose I would recommend you to use OProfile profiler. Of course, it will give you just a binary stack. In order to obtain actual function names you will need to provide symbols information to gdb.
Details:
[1] kernel stack and user-space stack
[2] how to dump kernel stack in syscall
[3] How to print the userspace stack trace in linux kernelspace

Get instruction pointer on segmentation fault or crash (for x86 JIT compiler project)?

I'm implementing the backend for a JavaScript JIT compiler that produces x86 code. Sometimes, as the result of bugs, I get segmentation faults. It can be quite difficult to trace back what caused them. Hence, I've been wondering if there would be some "easy" way to trap segmentation faults and other such crashes, and get the address of the instruction that caused the fault. This way, I could map the address back to compiled x86 assembly, or even back to source code.
This needs to work on Linux, but ideally on any POSIX compliant system. In the worst case, if I can't catch the seg fault and get the IP in my running JIT, I'd like to be able to trap it outside (kernel log?), and perhaps just have the compiler dump a big file with mappings of addresses to instructions, which I could match with a Python script or something.
Any ideas/suggestions are appreciated. Feel free to share your own debugging tips if you've ever worked on a compiler project of your own.
If you use sigaction, you can define a signal handler that takes 3 arguments:
void (*sa_sigaction)(int signum, siginfo_t *info, void *ucontext)
The third argument passed to the signal handler is a pointer to an OS and architecture specific data structure. On linux, its a ucontext_t which is defined in the <sys/ucontext.h> header file. Within that, uc_mcontext is an mcontext_t (machine context) which for x86 contains all the registers at the time of the signal in gregs. So you can access
ucontext->uc_mcontext.gregs[REG_EIP] (32 bit mode)
ucontext->uc_mcontext.gregs[REG_RIP] (64 bit mode)
to get the instruction pointer of the faulting instruction.

How to force abort on "glibc detected *** free(): invalid pointer"

In Linux environment, when getting "glibc detected *** free(): invalid pointer" errors, how do I identify which line of code is causing it?
Is there a way to force an abort? I recall there being an ENV var to control this?
How to set a breakpoint in gdb for the glibc error?
I believe if you setenv MALLOC_CHECK_ to 2, glibc will call abort() when it detects the "free(): invalid pointer" error. Note the trailing underscore in the name of the environment variable.
If MALLOC_CHECK_ is 1 glibc will print "free(): invalid pointer" (and similar printfs for other errors). If MALLOC_CHECK_ is 0, glibc will silently ignore such errors and simply return. If MALLOC_CHECK_ is 3 glibc will print the message and then call abort(). I.e. its a bitmask.
You can also call mallopt(M_CHECK_ACTION, arg) with an argument of 0-3, and get the same result as with MALLOC_CHECK_.
Since you're seeing the "free(): invalid pointer" message I think you must already be setting MALLOC_CHECK_ or calling mallopt(). By default glibc does not print those messages.
As for how to debug it, installing a handler for SIGABRT is probably the best way to proceed. You can set a breakpoint in your handler or deliberately trigger a core dump.
I recommend you get valgrind:
valgrind --tool=memcheck --leak-check=full ./a.out
In general, it looks like you might have to recompile glibc, ugh.
You don't say what environment you're running on, but if you can recompile your code for OS X, then its version of libc has a free() that listens to this environment variable:
MallocErrorAbort If set, causes abort(3) to be called if an
error was encountered in malloc(3) or
free(3) , such as a calling free(3) on a
pointer previously freed.
The man page for free() on OS X has more information.
If you're on Linux, then try Valgrind, it can find some impossible-to-hunt bugs.
How to set a breakpoint in gdb?
(gdb) b filename:linenumber
// e.g. b main.cpp:100
Is there a way to force an abort? I recall there being an ENV var to control this?
I was under the impression that it aborted by default. Make sure you have the debug version installed.
Or use libdmalloc5: "Drop in replacement for the system's malloc',realloc', calloc',free' and other memory management routines while providing powerful debugging facilities
configurable at runtime. These facilities include such things as memory-leak tracking, fence-post write detection, file/line number reporting, and general logging of statistics."
Add this to your link command
-L/usr/lib/debug/lib -ldmallocth
gdb should automatically return control when glibc triggers an abort.
Or you can set up a signal handler for SIGABRT to dump the stacktrace to a fd (file descriptor). Below, mp_logfile is a FILE*
void *array[512 / sizeof(void *)]; // 100 is just an arbitrary number of backtraces, increase if you want.
size_t size;
size = backtrace (array, 512 / sizeof(void *));
backtrace_symbols_fd (array, size, fileno(mp_logfile));

Resources