How do you determine where segfault occured when ip (null)? - linux

segfault at 0 ip (null) sp bf9ed55c error 4 in appname[8048000+252000]
If I don't have the IP address, how do I determine where the crash occurred? does it being (null) mean anything useful?
in the appname[8048000+262000] = 0x82Aa000 is that supposed to give a clue? is it the 0x82AA000 the value I should try to use, both nm output and map file don't give much help on that.

Things that could set the instruction pointer to NULL:
branch to NULL
call NULL
return to NULL
In the first two cases, the stack is still in the state of the frame where the branch or call came from. In the last case, probably something has clobbered the return address on the stack of the previous function, and it may not be clear what the stack should have been, but usually it's still possible to find some earlier stack frame and try reconstructing what happened from there on.
Binaries may be loaded at different addresses. appname[8048000+252000] describes a segment in file appname which was mapped into memory at addresses 8048000-82aa000, it doesn't pinpoint where there fault was.
You will have better luck debugging with a core dump, which would contain details about the state (such as the other registers) and what was mapped into memory where (including the contents of the stack). If you are using systemd, coredumps are stored in the journal and can be retrieved with coredumpctl (for example, start debugging the last crash with coredumpctl gdb).

Related

Stack location in Linux with ASLR

In Linux, with ASLR enabled, is there a range of addresses where user stack address lies? What about heap, instruction addresses(text section)?
In general, is it possible to look at an address and tell if it is for data or for code?
Edit:
I am trying to write a Pintool that looks at the EIP after a return and checks if the EIP points to a data area. Let's assume that NX is not enabled on this system.
For some reason, this was downvoted. Fortunately, the answer can be found here:
https://security.stackexchange.com/questions/185315/stack-location-range-on-linux-for-user-process/185330#185330
cat /proc/self/maps will show the initial location of the main thread's stack. This can be inaccurate for (at least) the following reasons:
you're not in the main thread
any part of the program was built with the -fsplit-stack option, or you call a library that does something similar
you're within a signal handler that requests the sigaltstack stack instead
you do weird alloca tricks like CHICKEN Scheme does to use the stack as a heap
...
Also note that the general areas are not fully random. See the AddressSanitizer project for something that takes advantage of this.

3.10 kernel crash BUG() in mark_bootmem()

I get a kernel crash at BUG() here - http://lxr.free-electrons.com/source/mm/bootmem.c?v=3.10#L385 with the following message
2kernel BUG at /kernel/mm/bootmem.c:385!
What could be a possible reason for this?
Following is the function call trace
[<c0e165f8>] (mark_bootmem+0xd0/0xe0) from [<c0e05d64>] (bootmem_init+0x16c/0x26
[<c0e05d64>] (bootmem_init+0x16c/0x264) from [<c0e07980>] (paging_init+0x734/0x7
[<c0e07980>] (paging_init+0x734/0x7d4) from [<c0e03f20>] (setup_arch+0x3e8/0x69c
[<c0e03f20>] (setup_arch+0x3e8/0x69c) from [<c0e007d8>] (start_kernel+0x78/0x370
[<c0e007d8>] (start_kernel+0x78/0x370) from [<10008074>] (0x10008074)
Thanks
The mm/bootmem.c file is responsible for Boot Memory Allocator. Function mark_bootmem marks memory pages between start and end addresses (start is rounded down and end is rounded up to page boundaries) as reserved (or not reserved when used for freeing) for this allocator.
It iterates over bdata_list trying to find a region containing first page from requested address range. It it won't find it, the BUG() you mentioned will be triggered. The same BUG() will be triggered if it succeeds finding it, but the region is not large enough (end is outside of the region). So this BUG() means that it wasn't able to find requested memory region to mark.
Now if I understand the kernel code correctly, on normal UMA systems there will be only one entry in bdata_list and it should describe the range of lowmemory pages available in the system. Since you didn't provide too much information about your system it's hard to guess exact reason for the problem but in general, it seems that your memory setup is broken. This thing is very architecture specific so it's hard to tell what exactly is going on.

Relevant debug data for a Linux target

For an embedded ARM system running in-field there is a need to retrieve relevant debug information when a user-space application crash occurs. Such information will be stored in a non-volatile memory so it could be retreived at a later time. All such information must be stored during runtime, and cannot use third-party applications due to memory consumption concerns.
So far I have thought of following:
Signal ID and corresponding PC / memory addresses in case a kernel SIG occurs;
Process ID;
What other information do you think it's relevant in order to indentify the causing problem and be able to do a fast debug afterwards?
Thank you!
Usually, to be able to understand an issue, you'll need every register (from r0 to r15), the CPSR, and the top of the stack (to be able to determine what happened before the crash). Please also note that, when your program is interrupt for any invalid operation (jump to invalid address, ...), the processor goes to an exception mode, while you need to dump the registers and stack in the context of your process.
To be able to investigate, using those data, you also must keep the ELF files (with debug information, if possible) from your build, to be able to interpret the content of your registers and stack.
In the end, the more information you keep, the easier the debug is, but it may be expensive to keep every memory sections used by your program at the time of the failure (as a matter of fact, I've never done this).
In postmortem analysis, you will face some limits :
Dynamically linked libraries : if your crash occurs in a dynamically loaded and linked code, you will also need the lib binary you are using on your target.
Memory corruption : memory corruption usually results in the call of random data as code. On ARM with linux, this will probably lead to a segfault, as you can't go to an other process memory area, and as your data will probably be marked as "never execute", nevertheless, when the crash happens, you may have already corrupted the data that could have allow you to identify the source of the corruption. Postmortem analysis isn't always able to identify the failure cause.

Understanding GDB and Segfault Messages

I was recently debugging an application that was segfaulting on a regular basis--I solved the problem, which was relatively mundane (reading from a null pointer), but I have a few residual questions I've been unable to solve on my own.
The gdb stack trace began like this in most cases:
0x00007fdff330059f in __strlen_sse42 () from /lib64/libc.so.6
Using information from /proc/[my proc id]/maps to attain the base address of the shared library, I could see that the problem occurred at the same instruction of the shared library--at instruction 0x13259f, which is
pcmpeqb (%rdi),%xmm1 (gdb)
So far, so good. But then, the OS (linux) would also write out an error message to /var/logs/messags, that looks like this
[3540502.783205] node[24638]: segfault at 0 ip 00007f8abbe6459f sp 00007fff7bf2f148 error 4 in libc-2.12.so[7f8abbd32000+189000]
Which confuses me. On the one hand, the kernel correctly identifies the fault (a user-mode protection fault), and, by subtracting the base address of the shared library from the instruction pointer, we arrive at the same relative offset--0x13259f--as we do by gdb. But the library the kernel identifies is different, the address of the instruction is different, and the function and instruction within that library is different. That is, the instruction within libc-2-12.so is
0x13259f <__memset_sse2+911>: movdqa %xmm0,-0x43(%edx)
So, my question is, how can gdb and the kernel message agree on the type of fault, and on the offset of the instruction relative to the base address of the shared library, but disagree on the address of the instruction pointer and the shared library being used?
But the library the kernel identifies is different,
No, it isn't. Do ls -l /lib64/libc.so.6, and you'll see that it's a symlink to libc-2.12.so.
the address of the instruction is different
The kernel message is for a different execution from the one you've observed in GDB, and address randomization caused libc-2.12.so to be loaded at a different base address.
and the function and instruction within that library is different. That is, the instruction within libc-2-12.so is 0x13259f <__memset_sse2+911>: movdqa %xmm0,-0x43(%edx)
It is likely that you looked at a different libc-2.12.so from the one that is actually used.

What is SEGV_MAPERR?

What is SEGV_MAPERR, why does it always come up with SIGSEGV?
There are two common kinds of SEGV, which is an error that results from an invalid memory access:
A page was accessed which had the wrong permissions. E.g., it was read-only but your code tried to write to it. This will be reported as SEGV_ACCERR.
A page was accessed that is not even mapped into the address space of the application at all. This will often result from dereferencing a null pointer or a pointer that was corrupted with a small integer value. This is reported as SEGV_MAPERR.
Documentation of a sort (indexed Linux source code) for SEGV_MAPERR is here: https://elixir.bootlin.com/linux/latest/A/ident/SEGV_MAPERR.
It's a segmentation fault. Most probably a dangling pointer issue, or some sort of buffer overflow.
SIGSSEGV is the signal that terminates it based on the issue, segmentation fault.
Check for dangling pointers as well as the overflow issue.
Enabling core dumps will help you determine the problem.

Resources