On linux, what can cause dlopen to emit SIGFPE? - linux

I have a library of dubious origins which is identified by file as a 32 bit executable. However, when I try to dlopen it on a 32 bit CentOS 4.4 machine, dlopen terminates with SIGFPE. Surely if there was something wrong with the format of the binary then dlopen should be handling an error?
So the question is: What kinds of problems can cause dlopen to emit SIGFPE?

Some possible reasons are:
Division by zero (rule this out with gdb)
Architecture mismatch (did you compile the DSO yourself on the same architecture? or is it prebuilt?)
ABI compatibility problems (loading a DSO built for one Linux distro on a different one).
Here is an interesting discussion regarding hash generation in the ELF format in GNU systems where an ABI mismatch can cause SIGFPE on systems when you mix and match DSOs not built on that distro/system.
Run GDB against your executable with:
]$ gdb ./my_executable
(gdb) run
When the program crashes, get a backtrace with
(gdb) bt
If the stack ends in do_lookup_x () then you likely have the same problem and should ensure your DSO is correct for the system you are trying to load it on ... However you do say it has dubious origins so the problem is probably an ABI problem similar to the one described.
Get a non-dubious library / executable! ;)
Good Luck

Related

"Illegal instruction" when running ARM code targeting my CPU

I'm compiling a rather large project for ARM. I'm using an AT91SAM9G25-EK as a devboard running a Debian ARM image. All libraries and executables in the image seem to be compiled for the armv4t instruction set.
My CPU is an ARM926EJ-S, which should run armv5tej code.
I'm using GCC to cross compile for my board. My CXX flags look like the following:
set(CMAKE_CXX_FLAGS "--signed-char --sysroot=${SYSROOT} -mcpu=arm926ej-s -mtune=arm926ej-s -mfloat-abi=softfp" CACHE STRING "" FORCE)
If I try to run this on my board, I get an Illegal Instruction signal (SIGILL) during initialization of one of my dependencies (using armv4t).
If I enable thumb mode (-mthumb -mthumb-interwork) it works, but uses Thumb for all the code, which in my case runs slower (I'm doing some serious number crunching).
In this case, if I specify one function to be compiled for ARM mode (using __attribute__((target("arm")))) it will run fine until that function is called, then exit with SIGILL.
I'm lost. Is it that bad I'm linking against libraries using armv4t? Am I misunderstanding the way ARM modes work? Is it something in the linux kernel?
What softfp means is to use using the soft-float calling convention between functions, but still use the hardware FPU within them. Assuming your cross-compiler is configured with a default -mfpu option other than "none" (run arm-whatever-gcc -v and look for --with-fpu= to check), then you've got a problem, because as far as I can see from the Atmel datasheet, SAM9G25 doesn't have an FPU.
My first instinct would be to slap GDB on there, catch the signal and disassemble the offending instruction to make sure, but the fact that Thumb code works OK is already a giveaway (Thumb before ARMv6T2 doesn't include any coprocessor instructions, and thus can't make use of an FPU).
In short, use -mfloat-abi=soft to ensure the ARM code actually uses software floating-point and avoids poking a non-existent FPU. And if the "serious number crunching" involves a lot of floating-point, perhaps consider getting a different MCU...

Debug a futex lock

I have a process waiting on a futex:
# strace -p 5538
Process 5538 attached - interrupt to quit
futex(0x7f86c9ed6a0c, FUTEX_WAIT, 20, NULL
How can I best debug such a situation? Can I identify who holds the futex? Are there any tools similar to ipcs and ipcrm but for futexes?
Try using gdb -p *PID* and then run where or bt to see a backtrace.
It won't be spectacularly useful with binaries and libraries that have had their debugging symbols stripped, but you may be able to deduce a fair bit from the context. It might be able to indicate to you which part of a complex process is hanging, and then you could examine the right part of the sources to search for the lock.
I have the same problem with a piece of c++ code. Running ubuntu 12.10 64bit. It looks to like a similar problem in 2007, where the libc was buggy (and maybe still is?).
I start a pthread which runs a traceroute in a system call. Printf before and after the system indicate, that the operating system hangs on the system call, WITHOUT executing the traceroute.
I am not sure if my linux is broken once again because of the ubuntu update, or if it's a libc related bug. Since a lot of applications seem to have "similar" problems, I assume it's stuck somewhere in the userspace.
My c++ code runs perfectly on 32bit systems and even 64bit osx, so i assume that ubuntu 12.10 + 64bit libc combination is broken.

How does linker know what to link with a system call?

When I was trying to compile squid by hand on a RHEL 5.5 server, run configure and got
configure: WARNING: Eep! Cannot find epoll, kqueue, /dev/poll, poll or select!
configure: WARNING: Will try select and hope for the best.
configure: Using select for the IO loop.
Looks like the kernel is not configured with CONFIG_EPOLL. So I tried to compile this example epoll program to check whether it works.
On my gentoo box (which CONFIG_EPOLL is enabled.), it's compiled without any problem.
On the server, I got
/tmp/cc8PhJh0.o: In function 'main':
epoll-exmaple.c:(.text+0x262): undefined reference to 'epoll_create1'
collect2: ld returned 1 exit status
We all know for c program compiler looks for the definitions in *.h files and linker links them with *.so files.
My questions is, epoll_create1 is a system call to kernel. Which file exactly does linker search to locate the implementation to that system call?
Thanks.
It looks in the system C library (normally; a small handful of system calls are in other special libraries like librt). The C library provides a C API for userspace programs that handles making the system call for you. Sometimes this can be a very thin wrapper around the system call that just takes care of setting up and returning the arguments, but more frequently it has various glue that you don't want to have to worry about, such as differences in data sizes between userspace and the kernel, differences in implementation for the different architectures, backward or forward compatibility for changes in the kernel system call API, and so forth.
% readelf -s /lib/i386-linux-gnu/libc.so.6 | grep epoll_create1
1837: 000d5280 52 FUNC GLOBAL DEFAULT 12 epoll_create1##GLIBC_2.9
If you look at the C library as above, you can see the C function that the linker is linking code against.

glibc: elf file OS ABI invalid

downloaded and compiled glibc-2.13. when i try to run a sample C program which does a malloc(). I get following error
elf file OS ABI invalid
Can anybody please pass my any pointer helpful in resolving this issue.Please note that my kernel version is linux-2.6.35.9
It's not your kernel version that's the problem.
The loader on your system does not support the new Linux ABI. Until relatively recently, Linux ELF binaries used the System V ABI. Recently, in support of STT_GNU_IFUNC, the Linux ABI was added. You would have to update your system C library to have a loader that support STT_GNU_IFUNC, and then it will also recognize ELF objects with the Linux ABI type.
See Dave Miller's blog entry on STT_GNU_IFUNC for Sparc (archived) to gain an understanding of what STT_GNU_IFUNC does, if you care.
If you get your hands in the loader from a newer system, you might be able to make it work using that. But you'll have to carry the loader wherever your program go. You can either compile your program to use that loader as explained here, or compile your program and patch it later using patchelf, in a way similar to what I mention here. I was able to run a program that was giving me the OS ABI invalid error on a linux 2.6.18 (older than yours) that had ld-2.5.so, by copying a ld-2.15.so from somewhere else.
NOTE: do NOT overwrite your system ld*.so or ld-linux. ;-/
It is possible your glibc was built with the --enable-multiarch flag that forced using ifunc and new LINUX ABI
From what I can tell is that --enable-multiarch is the default setting and you should disable it by setting --enable-multiarch=no.

How to increase probability of Linux core dumps matching symbols?

I have a very complex cross-platform application. Recently my team and I have been running stress tests and have encountered several crashes (and core dumps accompanying them). Some of these core dumps are very precise, and show me the exact location where the crash occurred with around 10 or more stack frames. Others sometimes have just one stack frame with ?? being the only symbol!
What I'd like to know is:
Is there a way to increase the probability of core dumps pointing in the right direction?
Why isn't the number of stack frames reported consistent?
Any best practice advise for managing core dumps.
Here's how I compile the binaries (in release mode):
Compiler and platform: g++ with glibc-2.3.2-95.50 on CentOS 3.6 x86_64 -- This helps me maintain compatibility with older versions of Linux.
All files are compiled with the -g flag.
Debug symbols are stripped from the final binary and saved in a separate file.
When I have a core dump, I use GDB with the executable which created the core, and the symbols file. GDB never complains that there's a mismatch between the core/binary/symbols.
Yet I sometimes get core dumps with no symbols at all! It's understandable that I'm linking against non-debug version of libstdc++ and libgcc, but it would be nice if at least the stack trace shows me where in my code did the faulty instruction call originate (although it may ultimately end in ??).
Others sometimes have just one stack frame with "??" being the only symbol!
There can be many reasons for that, among others:
the stack frame was trashed (overwritten)
EBP/RBP (on x86/x64) is currently not holding any meaningful value — this can happen e.g. in units compiled with -fomit-frame-pointer or asm units that do so
Note that the second point may occur simply by, for example, glibc being compiled in such a way. Having the debug info for such system libraries installed could mitigate this (something like what the glibc-debug{info,source} packages are on openSUSE).
gdb has more control over the program than glibc, so glibc's backtrace call would naturally be unable to print a backtrace if gdb cannot do so either.
But shipping the source would be much easier :-)
As an alternative, on a glibc system, you could use the backtrace function call (or backtrace_symbols or backtrace_symbols_fd) and filter out the results yourself, so only the symbols belonging to your own code are displayed. It's a bit more work, but then, you can really tailor it to your needs.
Have you tried installing debugging symbols of the various libraries that you are using? For example, my distribution (Ubuntu) provides libc6-dbg, libstdc++6-4.5-dbg, libgcc1-dbg etc.
If you're building with optimisation enabled (eg. -O2), then the compiler can blur the boundary between stack frames, for example by inlining. I'm not sure that this would cause backtraces with just one stack frame, but in general the rule is to expect great debugging difficulty since the code you are looking it in the core dump has been modified and so does not necessarily correspond to your source.

Resources