How to get line number for native crash in Android? - android-ndk

I have a native library and a backtrace from internal crash reporting tool that provides me exact addresses where the crash has occurred.
I used addr2line tool and found that it was our own native library where crash is occurring. However, I am just able to get function name in which the crash is occurring and not the exact line number since we ship library after stripping debug symbols.
I have access to unstripped .so of the same library where I can easily get line number. The only problem is I have to map the addresses in stripped library to corresponding addresses in unstripped library.
Is this possible to do?

Related

How addr2line can locate the source file and the line of code?

addr2line translates addresses into file names and line numbers. I am still beginner in debugging, and have some questions about addr2line.
If am debugging a certain .so (binary) file, how the tool can locate
its source code file (from where can get it!), what if the source doesn't exist?
What is the relation between the address in a binary and the line
number in its source, so addr2line can do this kind of mapping?
In general, addr2line works best on ELF executables or shared libraries with debug information. That debug information is emitted by the compiler when you pass -g (or -g2, etc...) to GCC. It notably provides a mapping between source code location (name of source file, line number, column number) and functions, variable names, call stack frame organization, etc etc... The debug information is today in DWARF format (and is also processed by the gdb debugger, the libbacktrace library, etc etc...). Notice that the debug information contains source file paths (not the source file itself).
In practice, you can (and often should) pass the -g (or -g2) debugging option to GCC even with optimization flags like -O2. In that case, the debug information is slightly less precise but still practically usable. In some cases, stack frames may disappear (inlined function calls, tail call optimizations, ....).
You could use the strip(1) utility to remove debug information (and other symbol tables, etc) from some ELF executable.

Loading and Unloading shared libraries in Mac OSX

I am sorry if this question has been repeated before in this forum. I am having a problem where, Loading and Unloading of dylibs arent working as expected in Mac(esp the unloading part.).
The question is if I have an executable and if I load a shared library say A.dylib and then use the loaded shared library to load an library say B.dylib. When I try unloading the library B.dylib at a later stage, the there is no error code returned(the return int value is 0 - as I am using a regular dlopen and dlclose functions to load and unload libraries, 0 means unloaded successfully), but when I check to make sure using the activity monitor or lsof the b.dylib is still in the memory.
Now the we are porting this code for windows, linux & mac. Windows and Linux works as expected, but only mac is giving me problems.
I was reading in the mac developer library and found out that: " There are a couple of cases in which a dynamic library will never be unloaded:
1) the main executable links against it, 2) An API that does not supoort unloading (e.g. NSAddImage())
was used to load it or some other dynamic library that depends on it, 3) the dynamic library is in
dyld's shared cache."
In my case I dont fall either of the first 2 cases. I am suspecting on case3.
Here is my question:
1. What can I do to make sure I have case 3?
2. If yes, how to fix it?
3. If not, how to fix it?
4. Why is mac so different?
Any help in this regard is appreciated!
Thanks,
Jan
When you load a shared library into an executable, all of the symbols exported by that library are candidates to resolve symbols required by the executable, causing the library to remain loaded if the DYLD linker binds to an unintended symbol. You can list the symbols in a shared library by using nm, and you can set environment variables to enable debugging output for the dynamic linker (see this man page on dyld). You need to set the DYLD_PRINT_BINDINGS environment variable.
Most likely, you need to limit the exported symbols to a specific subset that is used by the executable, so that only those symbols you intend to use are bound. This can be done by placing the required symbols in a file and passing it to the linker via the -exported_symbols_list option. Without doing so, you can end up binding a symbol in the dyloaded library, and it will not be unloaded since they are required to resolve a symbol in the executable and won't unload when dlclose() is called.

.lib files and decompiling

I have a .exe which is compiled from a combination of .for (fortran), and .c source files.
It does not run on anything later than Win98, due to an error with the graphics server:
“access violation error in User 32.dll at Ox7e4467a9”
Unless there is some other way around the above error (?), I assume I have to recompile the .exe from source using a more modern graphics server. I have all the files to do this bar one .lib file!
Is it possible to pull any info on the missing lib file out of the current .exe I have?
It is possible to dis-assemble the .exe, but I don't think I gain much from this?
You probably can't "cut" the lib file from an executable. Even if you could somehow get the code from it, standard compilers and linker wouldn't know how to link against it, since it won't have the linking information needed (they are not included in the result binary).
However, if your problem is that your program works on Win98, but doesn't run on NT-based systems (XP, Vista, Win7), I think it would be easier to find out, what incompatibility is there that crashes the program. You mentioned that the access violation occurs in user32.dll. Start your program inside a debugger, take a look at which function the crash occurs. Make sure you have your PDB symbols loaded (so you can see names of internal non-public functions). Trace down which Win32 API is called and what are its parameters. Try to figure out, what should be at the memory that cannot be accessed.
Also without any other information, it's impossible to help you with that.
Once integrated into an image file (your exe), a library (your .lib) which is statically bound to an application (which is done by your linker) cannot be separated, differentiated from your own code, and thus, one cannot retrieve the code from a lib by decompiling the exe.

How to increase probability of Linux core dumps matching symbols?

I have a very complex cross-platform application. Recently my team and I have been running stress tests and have encountered several crashes (and core dumps accompanying them). Some of these core dumps are very precise, and show me the exact location where the crash occurred with around 10 or more stack frames. Others sometimes have just one stack frame with ?? being the only symbol!
What I'd like to know is:
Is there a way to increase the probability of core dumps pointing in the right direction?
Why isn't the number of stack frames reported consistent?
Any best practice advise for managing core dumps.
Here's how I compile the binaries (in release mode):
Compiler and platform: g++ with glibc-2.3.2-95.50 on CentOS 3.6 x86_64 -- This helps me maintain compatibility with older versions of Linux.
All files are compiled with the -g flag.
Debug symbols are stripped from the final binary and saved in a separate file.
When I have a core dump, I use GDB with the executable which created the core, and the symbols file. GDB never complains that there's a mismatch between the core/binary/symbols.
Yet I sometimes get core dumps with no symbols at all! It's understandable that I'm linking against non-debug version of libstdc++ and libgcc, but it would be nice if at least the stack trace shows me where in my code did the faulty instruction call originate (although it may ultimately end in ??).
Others sometimes have just one stack frame with "??" being the only symbol!
There can be many reasons for that, among others:
the stack frame was trashed (overwritten)
EBP/RBP (on x86/x64) is currently not holding any meaningful value — this can happen e.g. in units compiled with -fomit-frame-pointer or asm units that do so
Note that the second point may occur simply by, for example, glibc being compiled in such a way. Having the debug info for such system libraries installed could mitigate this (something like what the glibc-debug{info,source} packages are on openSUSE).
gdb has more control over the program than glibc, so glibc's backtrace call would naturally be unable to print a backtrace if gdb cannot do so either.
But shipping the source would be much easier :-)
As an alternative, on a glibc system, you could use the backtrace function call (or backtrace_symbols or backtrace_symbols_fd) and filter out the results yourself, so only the symbols belonging to your own code are displayed. It's a bit more work, but then, you can really tailor it to your needs.
Have you tried installing debugging symbols of the various libraries that you are using? For example, my distribution (Ubuntu) provides libc6-dbg, libstdc++6-4.5-dbg, libgcc1-dbg etc.
If you're building with optimisation enabled (eg. -O2), then the compiler can blur the boundary between stack frames, for example by inlining. I'm not sure that this would cause backtraces with just one stack frame, but in general the rule is to expect great debugging difficulty since the code you are looking it in the core dump has been modified and so does not necessarily correspond to your source.

Why would the ELF header of a shared library specify Linux as the OSABI?

All the standard shared libraries on my Linux system (Fedora 9) specify ELFOSABI_NONE (0) as their OSABI.
This is fine - however I've received a shared library from a supplier where the OSABI given in the ELF header is ELFOSABI_LINUX (3).
This doesn't sound like an unreasonable value for a shared library intended for a Linux system, however it is a different value to that of all my other libraries - and so when I try to open this library, with dlopen(), from one of my other libraries this fails with the error "ELF file OS ABI invalid".
I compiled up the FreeBSD utility brandelf.c and used it to change the OSABI type to 0 and now the library seems to play fine with everything else.
I'm just wondering - why do you think this library is marked as ELFOSABI_LINUX? I'm guessing maybe they cross compiled on another system and specified some gcc flag that caused this value to be set into the ELF header? I tried to achieve something similar but couldn't determine the appropriate gcc flag or flags.
I'd like to know what the likely cause is as this particular supplier wont do anything without a lot of hand holding and I'd like to be able to say "you're probably doing X but this means we have to modify your libraries after we take delivery of them".
Possibly the vendor is cross compiling on FreeBSD or using a very recent Fedora system where anything using STT_GNU_IFUNC will be marked as ELFOSABI_LINUX. If you are trying to use it on Linux there should be no problems with changing it to ELFOSABI_NONE like you have done.

Resources