Linux coredump backtrace missing frames - linux

I got a core dump from a multi-threaded process segmentation fault crash. While inspecting the core file using GDB, I found some threads (not all) having such backtrace:
Thread 4 (LWP 3344):
#0 0x405ced04 in select () from /lib/arm-linux-gnueabi/libc.so.6
#1 0x405cecf8 in select () from /lib/arm-linux-gnueabi/libc.so.6
#2 0x000007d0 in ?? ()
#3 0x000007d0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I check in our source code and found those threads do eventually call select(). I'd like to understand why/how those middle frames are omitted.
Such pattern also occurs to read() call.
Any idea what is going on here? I'm afraid that this indicates something wrong with our coredump configuration, or something. Thank you in advance for the help!!
Edit:
Thanks for all responses. I apologize I didn't give enough info. Here are more:
The executable is built with compiler -g and without any optimizations, using -O0.
We generally only used less than half of our RAM 300-400 MB/1G.
Actually, I also saw this pattern backtrace in different core files (dumped for different faults).
What makes this symptom really wired (differ from ordinary stack corrupt) is that more than one threads have such back trace pattern, with frame #0, #1 exactly the same as this one, but #2 #3 addresses may differ from this.

It looks like you might be compiling without debugging enabled.
Make sure you pass in -g when compiling.
Also, as Joachim Pileborg mentioned in his comment, the repeated stack frame implies that you probably corrupted your stack somewhere. That is usually the result of writing past the end of an array, into the stored frame pointer.

If you want to check segmentation faults which are causing due to memory related problem or want to check leak of memory than its better to use Valgrind which gives all information regarding memory leak.

Related

Two questions regarding threads

If there is a missing/corrupted library in the gdb core how do I isolate it?
I also read that there is a possibility the thread could have overwritten its own stack , how do I detect that ?
how do I isolate the above problems with the below bt ?
/etc/gdb/gdbinit:105: Error in sourced command file:
Error while executing Python code.
Reading symbols from /opt/hsp/bin/addrman...done.
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
warning: Corrupted shared library list: 0x0 != 0x4ed700
warning: no loadable sections found in added symbol-file system-supplied DSO at
0x7ffd50ff6000
Core was generated by `addrman --notification-socket
/opt/hsp/sockets/memb_notify.socket'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
292 ./ps/ps_lock_track.h: No such file or directory.
(gdb) bt
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
#1 0x0000000000000000 in ?? ()
It looks like the core file is corrupt, likely due to heap or stack corruption. Corruption is oftentimes the result of a buffer overflow or other undefined behavior.
If you are running on Linux, I would try valgrind. It can oftentimes spot corruption very quickly. Windows has some similar tools.
Yes, a multithreaded application can overflow the stack. Each thread is only allocated a limited amount. This usually only happens if you have very deep function call stack or you are allocating large local object on the stack.
Some interesting information here and here on setting the stack size for Linux applications.
Faced with your problem, I would:
Check all the callers of the lktrk_locer_set method. Carefully investigate each, if possible, to see if there is obvious stack overflow or heap corruption
Try to use Valgrind or similar tools to spot the issue
Add debug logging to isolate the issue
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
The above error is usually a sign that you gave GDB different system libraries (or the main binary) from the ones used when the core dump was produced.
Either you are analyzing a "production" core dump on a development machine, or you've upgraded system libraries between the time core dump was produced and when you are analyzing it, or you've rebuilt the main binary.
See this answer for what to do if one of the above is correct.

best approach to debug "corrupted double-linked list" crash

I am in the process of debugging a "corrupted double-linked list" crash. I have seen the source and understand the chunk struct and the fd/bk pointers, etc, so I think I know why this crash has occurred. I am now trying to fix it and I have a couple of questions.
Question #1: where (with respect to the pointer returned from malloc) is the malloc_chunks struct maintained? Are they before the memory block or after it?
Question #2: the malloc_chunks for allocated memory are different from the malloc_chunks for unallocated memory. It appears (??) that the allocated buffer case does not have the fd/bk pointers. Is this correct?
Question #3: what is the recommended approach to debug this type of error? I am assuming that I should put a break point for the malloc_chunks so I can break on when the struct is overwritten. But I am not sure how to access those malloc structs so I can set a break point in gdb.
Any suggestions on how to proceed would be very appreciated.
Thanks,
-Andres
what is the recommended approach to debug this type of error?
The usual way is not to peek into GLIBC internals, but to use a tool like Valgrind or AddressSanitizer, either of which is likely to point you straight at the problem.
Update:
Valgrind crashes ...
You should try building the latest Valgrind version from source, and if that still crashes, report the crash to Valgrind developers.
Chances are the Valgrind problem is already fixed, and building new Valgrind and testing your program with it will still be faster than trying to debug GLIBC internals (heap corruption bugs are notoriously difficult to find by program inspection or debugging).
AddressSanitizer, I thought it was a clang only tool -- I do not think it is available for linux.
Two points:
Clang works just fine on Linux, I use it almost every day,
Recent GCC versions have an equivalent -fsanitize=address option.
There are ways to debug heap overruns without valgrind.
One way is to use a malloc debug library such as Electric Fence. It will make your rogram crash exactly at the moment of accessing an illegal address in the heap.
The other way is to use built-in debug capabilities of GNU malloc. See man mcheck. If you call mcheck_pedantic before the first call to malloc, then every memory block is checked at every allocation. This is very slow but does allow you to isolate the fault.

How can I check for a malloc() failure within a CUDA kernel? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
This is a fairly self-explanatory question. Some background info is appended.
How can I check for a malloc() failure within a CUDA kernel? I googled this and found nothing on what malloc() returns in a CUDA implementation.
In addition, I have no idea how to signal back to the host that there was an error within a CUDA kernel. How can I do this?
I thought one way would be to send an array of chars, one element for each kernel thread, and have the kernel place a 0x01 to signal an error and 0x00 for no error. Then the host could copy this memory back and check for any non zero bytes?
But this seems like a waste of memory. Is there a better way? Something like cudaThrowError()? ... maybe? ...
Appended:
I am running into trouble with a cuda error: GPUassert: the launch timed out and was terminated main.cu
If you google this, you will find info for Linux users (who have hybrid graphics solutions) - the fix is sometimes to run with optirun --no-xorg.
However in my case this isn't working.
If I run my program for a small enough data set, I get no errors. For a large enough data set, but not too large, I have to prevent time out errors by passing the --no-xorg flag. For an even larger dataset I get timeout errors regardless of the --no-xorg flag.
This hints to me that perhaps something else is going wrong?
Perhaps a malloc() failure within my kernel if I run out of memory?
I have checked my code and estimated memory usage - I don't think this is the problem, but I would like to check anyway.
How can I check for a malloc() failure within a CUDA kernel?
The behavior is the same as malloc on the host. If a malloc failure occurs, the returned pointer will be NULL.
So check for NULL after a malloc, and do something to address it:
#include <assert.h>
...
int *data
data = (int *)malloc(dsize*sizeof(int));
assert(data != NULL);
...rest of your code...
Notes:
It's legal to use assert in-kernel this way. If the assert is hit, your kernel will halt, and return an error to the host, which you can observe with proper cuda error checking or cuda-memcheck. This isn't the only possible way to handle a malloc failure, it's just a suggestion.
This may or may not be the problem with your actual code. This is good practice, however.

Heap Consistency Checking on Embedded System

I get a crash like this:
#0 0x2c58def0 in raise () from /lib/libpthread.so.0
#1 0x2d9b8958 in abort () from /lib/libc.so.0
#2 0x2d9b7e34 in __malloc_consolidate () from /lib/libc.so.0
#3 0x2d9b6dc8 in malloc () from /lib/libc.so.0
I guess it is a heap corruption issue. uclibc does not have mcheck/mprobe. Valgrind does not seem to MIPS support and my app (which is multi-threaded) depends on hw specific drivers. Any suggestions to check the consistency of the heap and to detect corruption?
I would use a replacement malloc() (see also this answer) that can easily be made to be more verbose. I'm not saying you need garbage collection, but you do seem to need the additional logging facilities that the link provides.
If it is heap corruption, the collector is going to choke on it as well, and give you more meaningful messages. It should not be too difficult to use, get what you need, then stop using (especially if you just let it intercept malloc()).
Its not going to zero in on the problem like Valgrind does, but at least its an option :)
You could write stub drivers that pretend to be the hardware, which should let you build and test your program in a more full-featured environment.

Porting Unix ada app to Linux: Seg fault before program begins

I am an intern who was offered the task of porting a test application from Solaris to Red Hat. The application is written in Ada. It works just fine on the Unix side. I compiled it on the linux side, but now it is giving me a seg fault. I ran the debugger to see where the fault was and got this:
Warning: In non-Ada task, selecting an Ada task.
=> runtime tasking structures have not yet been initialized.
<non-Ada task> with thread id 0b7fe46c0
process received signal "Segmentation fault" [11]
task #1 stopped in _dl_allocate_tls
at 0870b71b: mov edx, [edi] ;edx := [edi]
This seg fault happens before any calls are made or anything is initialized. I have been told that 'tasks' in ada get started before the rest of the program, and the problem could be with a task that is running.
But here is the kicker. This program just generates some code for another program to use. The OTHER program, when compiled under linux gives me the same kind of seg fault with the same kind of error message. This leads me to believe there might be some little tweak I can use to fix all of this, but I just don't have enough knowledge about Unix, Linux, and Ada to figure this one out all by myself.
This is a total shot in the dark, but you can have tasks blow up like this at startup if they are trying to allocate too much local memory on the stack. Your main program can safely use the system stack, but tasks have to have their stack allocated at startup from dynamic memory, so typcially your runtime has a default stack size for tasks. If your task tries to allocate a large array, it can easily blow past that limit. I've had it happen to me before.
There are multiple ways to fix this. One way is to move all your task-local data into package global areas. Another is to dynamically allocate it all.
If you can figure out how much memory would be enough, you have a couple more options. You can make the task a task type, and then use a
for My_Task_Type_Name'Storage_Size use Some_Huge_Number;
statement. You can also use a "pragma Storage_Size(My_Task_Type_Name)", but I think the "for" statement is preferred.
Lastly, with Gnat you can also change the default task stack size with the -d flag to gnatbind.
Off the top of my head, if the code was used on Sparc machines, and you're now runing on an x86 machine, you may be running into endian problems.
It's not much help, but it is a common gotcha when going multiplat.
Hunch: the linking step didn't go right. Perhaps the wrong run-time startup library got linked in?
(How likely to find out what the real trouble was, months after the question was asked?)

Resources