best approach to debug "corrupted double-linked list" crash - linux

I am in the process of debugging a "corrupted double-linked list" crash. I have seen the source and understand the chunk struct and the fd/bk pointers, etc, so I think I know why this crash has occurred. I am now trying to fix it and I have a couple of questions.
Question #1: where (with respect to the pointer returned from malloc) is the malloc_chunks struct maintained? Are they before the memory block or after it?
Question #2: the malloc_chunks for allocated memory are different from the malloc_chunks for unallocated memory. It appears (??) that the allocated buffer case does not have the fd/bk pointers. Is this correct?
Question #3: what is the recommended approach to debug this type of error? I am assuming that I should put a break point for the malloc_chunks so I can break on when the struct is overwritten. But I am not sure how to access those malloc structs so I can set a break point in gdb.
Any suggestions on how to proceed would be very appreciated.
Thanks,
-Andres

what is the recommended approach to debug this type of error?
The usual way is not to peek into GLIBC internals, but to use a tool like Valgrind or AddressSanitizer, either of which is likely to point you straight at the problem.
Update:
Valgrind crashes ...
You should try building the latest Valgrind version from source, and if that still crashes, report the crash to Valgrind developers.
Chances are the Valgrind problem is already fixed, and building new Valgrind and testing your program with it will still be faster than trying to debug GLIBC internals (heap corruption bugs are notoriously difficult to find by program inspection or debugging).
AddressSanitizer, I thought it was a clang only tool -- I do not think it is available for linux.
Two points:
Clang works just fine on Linux, I use it almost every day,
Recent GCC versions have an equivalent -fsanitize=address option.

There are ways to debug heap overruns without valgrind.
One way is to use a malloc debug library such as Electric Fence. It will make your rogram crash exactly at the moment of accessing an illegal address in the heap.
The other way is to use built-in debug capabilities of GNU malloc. See man mcheck. If you call mcheck_pedantic before the first call to malloc, then every memory block is checked at every allocation. This is very slow but does allow you to isolate the fault.

Related

How does one go about fixing a memory leak?

I've got this game server that whether i download the pre-compiled binary or compile the source code myself just leaks until i have to reboot or enter a BSOD. I'm not super keen on C++ just currently classes for my degree but i can look at the code and understand whats going on. I'm just not 'fluent'.
Specifically looking at the resource monitor the modified memory type just fills and fills constantly by about 3-5MB per 5 seconds
is there anything i can do about this?
There is tool which is helpful for finding memory leaks: http://valgrind.org/
If you have ever ever heard about a tool called valgrind you can run your C++ code in valgrind to see exactly where the leakages are.
http://valgrind.org/

Under Linux, how do I track down a memory leak in pre-built software?

I have a new Ubuntu Linux Server 64bit 10.04 LTS.
A default install of Mysql with replication turned on appears to be leaking memory.
However, we've tried going back to an earlier version and memory is still leaking but I can't tell where.
What tools/techniques can I use to pinpoint where memory is leaking so that I can rectify the problem?
Valgrind, http://valgrind.org/, can be very useful in these situations. It runs on unmodified executables but it does help tremendously if you can install the debugging symbols. Be sure to use the --show-reachable=yes flag as the leaked memory may still be reachable in some way but just not the way you want it. Also --trace-children in case of a fork. You'll likely have to track down in the start-up script where the executable is called and then add something like the following:
valgrind --show-reachable=yes --trace-children=yes --log-file=/path/to/log SQL-cmdline sqlargs
The man page has lots of other potentially useful options.
Have you tried the MySQL mailing list? Something like this would certainly be of interest to them if you can reproduce it in a straightforward manner.
You can use Valgrind as ninjalj suggests, but I doubt you'll get that close to anything useful. Even if you see a real leak (and they will be hard enough to validate), tracking down the root cause through the C call stacks will likely be very annoying (for example if the leak is triggered by a particular SQL pattern or stored procedure, you'll be looking at the call stack from the resultant optimized query, and not the original calls, which are likely in a different language).
Normally you might have no recourse, and have to resort to tracking it down through callstacks and iterative testing, but you have the source code to MySQL (including the source for the exact default package install), so you can use more advanced tools like MemoryScape (or at least build with symbols in order to provide Valgrind more food for thought).
Try using valgrind.
A very good and powerful tool, which is installed/available for most distributions is Valgrind.
It has a plethora of different options and is pretty much (as far as I've seen) the default profiler under linux systems.

Can you recommend a good debugging malloc library for linux?

Can you recommend a good debugging malloc library for linux? I know there are a lot of options out there, I just need to know which libraries people are actually using to solve real-life problems.
Thanks!
EDIT: I know about Valgrind, but sometimes the performance is really too low.
Valgrind. :-) It's not a malloc library, but, it's really good at finding memory management and memory usage bugs.
http://valgrind.org/ for finding memory leaks and heap corruption.
http://dmalloc.com/ for general purpose heap debugging.
gcc now comes with sanitizers which are much more faster than valgrind. you can check different compiler options under -fsanitize. More info here
The GNU C library itself has some debugging features and hooks you can use to add your own.
For documentation on a Linux system type info libc and then g Heap<TAB>. Another useful info node is "Hooks for Malloc", you can get there with g Hooks<TAB>
This might not be very useful to you, but you could write your own malloc wrapper. In our special "diagnostic" builds it keeps a table of all outstanding allocations (including the file name and line number where the allocation occurred) and prints out anything that was still outstanding at exit time. It also uses canary words (to check for buffer overflows) and a combination of memory re-writing and block checksumming after free and before reallocation (to check for use-after-free).
If your product is sufficiently large it might be annoying to have to find-replace your entire source, hoping for the best. Also, the development time for your own malloc wrapper is probably not negligible. Doing lots of heavyweight stuff like what I mentioned above probably won't help out your speed problem, either. Writing your own wrapper would allow the most flexibility, though.

Heap Consistency Checking on Embedded System

I get a crash like this:
#0 0x2c58def0 in raise () from /lib/libpthread.so.0
#1 0x2d9b8958 in abort () from /lib/libc.so.0
#2 0x2d9b7e34 in __malloc_consolidate () from /lib/libc.so.0
#3 0x2d9b6dc8 in malloc () from /lib/libc.so.0
I guess it is a heap corruption issue. uclibc does not have mcheck/mprobe. Valgrind does not seem to MIPS support and my app (which is multi-threaded) depends on hw specific drivers. Any suggestions to check the consistency of the heap and to detect corruption?
I would use a replacement malloc() (see also this answer) that can easily be made to be more verbose. I'm not saying you need garbage collection, but you do seem to need the additional logging facilities that the link provides.
If it is heap corruption, the collector is going to choke on it as well, and give you more meaningful messages. It should not be too difficult to use, get what you need, then stop using (especially if you just let it intercept malloc()).
Its not going to zero in on the problem like Valgrind does, but at least its an option :)
You could write stub drivers that pretend to be the hardware, which should let you build and test your program in a more full-featured environment.

Porting Unix ada app to Linux: Seg fault before program begins

I am an intern who was offered the task of porting a test application from Solaris to Red Hat. The application is written in Ada. It works just fine on the Unix side. I compiled it on the linux side, but now it is giving me a seg fault. I ran the debugger to see where the fault was and got this:
Warning: In non-Ada task, selecting an Ada task.
=> runtime tasking structures have not yet been initialized.
<non-Ada task> with thread id 0b7fe46c0
process received signal "Segmentation fault" [11]
task #1 stopped in _dl_allocate_tls
at 0870b71b: mov edx, [edi] ;edx := [edi]
This seg fault happens before any calls are made or anything is initialized. I have been told that 'tasks' in ada get started before the rest of the program, and the problem could be with a task that is running.
But here is the kicker. This program just generates some code for another program to use. The OTHER program, when compiled under linux gives me the same kind of seg fault with the same kind of error message. This leads me to believe there might be some little tweak I can use to fix all of this, but I just don't have enough knowledge about Unix, Linux, and Ada to figure this one out all by myself.
This is a total shot in the dark, but you can have tasks blow up like this at startup if they are trying to allocate too much local memory on the stack. Your main program can safely use the system stack, but tasks have to have their stack allocated at startup from dynamic memory, so typcially your runtime has a default stack size for tasks. If your task tries to allocate a large array, it can easily blow past that limit. I've had it happen to me before.
There are multiple ways to fix this. One way is to move all your task-local data into package global areas. Another is to dynamically allocate it all.
If you can figure out how much memory would be enough, you have a couple more options. You can make the task a task type, and then use a
for My_Task_Type_Name'Storage_Size use Some_Huge_Number;
statement. You can also use a "pragma Storage_Size(My_Task_Type_Name)", but I think the "for" statement is preferred.
Lastly, with Gnat you can also change the default task stack size with the -d flag to gnatbind.
Off the top of my head, if the code was used on Sparc machines, and you're now runing on an x86 machine, you may be running into endian problems.
It's not much help, but it is a common gotcha when going multiplat.
Hunch: the linking step didn't go right. Perhaps the wrong run-time startup library got linked in?
(How likely to find out what the real trouble was, months after the question was asked?)

Resources