Why is a segmentation fault not reproducible in gdb? - linux

I have a situation where I run a number of unit tests and one of them triggers a segmentation fault. The symptom seems related to another test case run roughly 30 test cases prior to the failing one. Obviously there is some dependency between the test cases and I can easily turn on and off the segmentation fault by commenting out the earlier test case. Google Test/Mock 1.6.0 is used as test framework. The test binary is written entirely in C++ (gcc 4.6.3). It is single threaded (unless Google Test creates threads).
However, when I run all test cases in gdb there is no segmentation fault and this is what puzzles me.
What are realistic reasons why there would be a segmentation fault when running a binary in a terminal, but not when running the very same binary through gdb? I guess everything is slightly slower when gdb runs the code, but I don't see how this would affect the outcome.
I just do this to see no fault:
gdb MyBinary
run
Last lines of terminal printout:
[ PASSED ] 368 tests.
[Inferior 1 (process 28349) exited normally]
And this to see the fault:
MyBinary
Last line of terminal printout:
Segmentation fault

What are realistic reasons why there would be a segmentation fault when running a binary in a terminal, but not when running the very same binary through gdb?
The two most common ones are:
GDB disables address space randomization. If you are reading some uninitialized pointer, and that pointer always happens to be NULL under GDB, but may not be NULL with ASLR.
You have a data race, and GDB slows down thread creation to hide that race (GDB has to do a lot of work to keep track of all threads).
You can prevent GDB from disabling ASLR with set disable-randomization off.
You should probably check your tests using MemorySanitizer and ThreadSanitizer.

Related

How to find reason for SEGFAULT in large multi-thread C program?

I am working with a multi-thread program for Linux embedded systems that crashes very randomly through SEGFAULT signal.
I need to find the issue WITHOUT using gdb since the crash occurs only under production environment, never during testing.
I know the symbol table of the program and I'm using sigaction() and backtrace() in main thread but I don't get enough information. The backtraced lines are from sigaction function itself. I allow 50 frames to be captured and I use -g flag in gcc for compilation:
Caught segfault at address 0xe76acc
Obtained 3 stack frames.
./mbca(_Z11print_tracev+0x20) [0x37530]
./mbca(_Z18segfault_sigactioniP7siginfoPv+0x34) [0x375f4]
/lib/libc.so.6(__default_rt_sa_restorer_v2+0) [0x40db5a60]
As the program is running 15 threads, I would like to get a clue about from which one is coming the signal so I can limit the possibilities. FYI Main thread creates a fork and that fork creates the remaining 14 threads.
How can I achive this? What can I do with the information I already have?
Thank you all for your help
PD: I also tried Core-dump file but it is not generated because this option was not included in kernel compilation and I cannot modify it.

Different Stack Pointer at the same point for multiple runs

I am encountering the following behavior for multiple runs of a program on a Linux x86 machine:
the command run is env -i ./prog. The program is crafted to fail with a SIGSEGV
the output from dmesg shows that the Stack Pointer at the time of the SIGSEGV varies after each execution, even though the program's flow is exactly the same
I have disabled the ASLR.
Why is the SP varying after each execution?

Get instruction pointer on segmentation fault or crash (for x86 JIT compiler project)?

I'm implementing the backend for a JavaScript JIT compiler that produces x86 code. Sometimes, as the result of bugs, I get segmentation faults. It can be quite difficult to trace back what caused them. Hence, I've been wondering if there would be some "easy" way to trap segmentation faults and other such crashes, and get the address of the instruction that caused the fault. This way, I could map the address back to compiled x86 assembly, or even back to source code.
This needs to work on Linux, but ideally on any POSIX compliant system. In the worst case, if I can't catch the seg fault and get the IP in my running JIT, I'd like to be able to trap it outside (kernel log?), and perhaps just have the compiler dump a big file with mappings of addresses to instructions, which I could match with a Python script or something.
Any ideas/suggestions are appreciated. Feel free to share your own debugging tips if you've ever worked on a compiler project of your own.
If you use sigaction, you can define a signal handler that takes 3 arguments:
void (*sa_sigaction)(int signum, siginfo_t *info, void *ucontext)
The third argument passed to the signal handler is a pointer to an OS and architecture specific data structure. On linux, its a ucontext_t which is defined in the <sys/ucontext.h> header file. Within that, uc_mcontext is an mcontext_t (machine context) which for x86 contains all the registers at the time of the signal in gregs. So you can access
ucontext->uc_mcontext.gregs[REG_EIP] (32 bit mode)
ucontext->uc_mcontext.gregs[REG_RIP] (64 bit mode)
to get the instruction pointer of the faulting instruction.

Program runs with gdb but doesn't run with ./ProgramName

I am writing an editor in assembly 64bit mode in linux. It runs correctly when I debug the program in GDB but it does not run correctly when I run it normally it means it has runtime errors when I use ./programName .
You're probably accessing uninitialized data or have some kind of memory corruption problem. This would explain the program behaving differently when run in the debugger - you're seeing the results of undefined behavior.
Run your program through valgrind's memcheck tool and see what it outputs. Valgrind is a powerful tool that will identify many runtime errors on Linux, including a full stack trace to the error.
If GDB disabling ASLR is what makes it work, perhaps set disable-randomization off in GDB will let your reproduce a crash inside GDB so you can debug it. Force gdb to load shared library at randomized address.
Otherwise enable core dumps from your program, and use GDB on the core dump.
gdb ./prog core.1234.
On x86, you can insert a ud2 instruction in your asm source to intentionally cause a crash at whatever point you want in your code, if you want to get a coredump to examine registers/memory at some point before it crashes on its own. All architectures have an undefined instruction you can use, but I only know the mnemonic for x86's off the top of my head.

Porting Unix ada app to Linux: Seg fault before program begins

I am an intern who was offered the task of porting a test application from Solaris to Red Hat. The application is written in Ada. It works just fine on the Unix side. I compiled it on the linux side, but now it is giving me a seg fault. I ran the debugger to see where the fault was and got this:
Warning: In non-Ada task, selecting an Ada task.
=> runtime tasking structures have not yet been initialized.
<non-Ada task> with thread id 0b7fe46c0
process received signal "Segmentation fault" [11]
task #1 stopped in _dl_allocate_tls
at 0870b71b: mov edx, [edi] ;edx := [edi]
This seg fault happens before any calls are made or anything is initialized. I have been told that 'tasks' in ada get started before the rest of the program, and the problem could be with a task that is running.
But here is the kicker. This program just generates some code for another program to use. The OTHER program, when compiled under linux gives me the same kind of seg fault with the same kind of error message. This leads me to believe there might be some little tweak I can use to fix all of this, but I just don't have enough knowledge about Unix, Linux, and Ada to figure this one out all by myself.
This is a total shot in the dark, but you can have tasks blow up like this at startup if they are trying to allocate too much local memory on the stack. Your main program can safely use the system stack, but tasks have to have their stack allocated at startup from dynamic memory, so typcially your runtime has a default stack size for tasks. If your task tries to allocate a large array, it can easily blow past that limit. I've had it happen to me before.
There are multiple ways to fix this. One way is to move all your task-local data into package global areas. Another is to dynamically allocate it all.
If you can figure out how much memory would be enough, you have a couple more options. You can make the task a task type, and then use a
for My_Task_Type_Name'Storage_Size use Some_Huge_Number;
statement. You can also use a "pragma Storage_Size(My_Task_Type_Name)", but I think the "for" statement is preferred.
Lastly, with Gnat you can also change the default task stack size with the -d flag to gnatbind.
Off the top of my head, if the code was used on Sparc machines, and you're now runing on an x86 machine, you may be running into endian problems.
It's not much help, but it is a common gotcha when going multiplat.
Hunch: the linking step didn't go right. Perhaps the wrong run-time startup library got linked in?
(How likely to find out what the real trouble was, months after the question was asked?)

Resources