Heap Consistency Checking on Embedded System - linux

I get a crash like this:
#0 0x2c58def0 in raise () from /lib/libpthread.so.0
#1 0x2d9b8958 in abort () from /lib/libc.so.0
#2 0x2d9b7e34 in __malloc_consolidate () from /lib/libc.so.0
#3 0x2d9b6dc8 in malloc () from /lib/libc.so.0
I guess it is a heap corruption issue. uclibc does not have mcheck/mprobe. Valgrind does not seem to MIPS support and my app (which is multi-threaded) depends on hw specific drivers. Any suggestions to check the consistency of the heap and to detect corruption?

I would use a replacement malloc() (see also this answer) that can easily be made to be more verbose. I'm not saying you need garbage collection, but you do seem to need the additional logging facilities that the link provides.
If it is heap corruption, the collector is going to choke on it as well, and give you more meaningful messages. It should not be too difficult to use, get what you need, then stop using (especially if you just let it intercept malloc()).
Its not going to zero in on the problem like Valgrind does, but at least its an option :)

You could write stub drivers that pretend to be the hardware, which should let you build and test your program in a more full-featured environment.

Related

best approach to debug "corrupted double-linked list" crash

I am in the process of debugging a "corrupted double-linked list" crash. I have seen the source and understand the chunk struct and the fd/bk pointers, etc, so I think I know why this crash has occurred. I am now trying to fix it and I have a couple of questions.
Question #1: where (with respect to the pointer returned from malloc) is the malloc_chunks struct maintained? Are they before the memory block or after it?
Question #2: the malloc_chunks for allocated memory are different from the malloc_chunks for unallocated memory. It appears (??) that the allocated buffer case does not have the fd/bk pointers. Is this correct?
Question #3: what is the recommended approach to debug this type of error? I am assuming that I should put a break point for the malloc_chunks so I can break on when the struct is overwritten. But I am not sure how to access those malloc structs so I can set a break point in gdb.
Any suggestions on how to proceed would be very appreciated.
Thanks,
-Andres
what is the recommended approach to debug this type of error?
The usual way is not to peek into GLIBC internals, but to use a tool like Valgrind or AddressSanitizer, either of which is likely to point you straight at the problem.
Update:
Valgrind crashes ...
You should try building the latest Valgrind version from source, and if that still crashes, report the crash to Valgrind developers.
Chances are the Valgrind problem is already fixed, and building new Valgrind and testing your program with it will still be faster than trying to debug GLIBC internals (heap corruption bugs are notoriously difficult to find by program inspection or debugging).
AddressSanitizer, I thought it was a clang only tool -- I do not think it is available for linux.
Two points:
Clang works just fine on Linux, I use it almost every day,
Recent GCC versions have an equivalent -fsanitize=address option.
There are ways to debug heap overruns without valgrind.
One way is to use a malloc debug library such as Electric Fence. It will make your rogram crash exactly at the moment of accessing an illegal address in the heap.
The other way is to use built-in debug capabilities of GNU malloc. See man mcheck. If you call mcheck_pedantic before the first call to malloc, then every memory block is checked at every allocation. This is very slow but does allow you to isolate the fault.

Linux coredump backtrace missing frames

I got a core dump from a multi-threaded process segmentation fault crash. While inspecting the core file using GDB, I found some threads (not all) having such backtrace:
Thread 4 (LWP 3344):
#0 0x405ced04 in select () from /lib/arm-linux-gnueabi/libc.so.6
#1 0x405cecf8 in select () from /lib/arm-linux-gnueabi/libc.so.6
#2 0x000007d0 in ?? ()
#3 0x000007d0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I check in our source code and found those threads do eventually call select(). I'd like to understand why/how those middle frames are omitted.
Such pattern also occurs to read() call.
Any idea what is going on here? I'm afraid that this indicates something wrong with our coredump configuration, or something. Thank you in advance for the help!!
Edit:
Thanks for all responses. I apologize I didn't give enough info. Here are more:
The executable is built with compiler -g and without any optimizations, using -O0.
We generally only used less than half of our RAM 300-400 MB/1G.
Actually, I also saw this pattern backtrace in different core files (dumped for different faults).
What makes this symptom really wired (differ from ordinary stack corrupt) is that more than one threads have such back trace pattern, with frame #0, #1 exactly the same as this one, but #2 #3 addresses may differ from this.
It looks like you might be compiling without debugging enabled.
Make sure you pass in -g when compiling.
Also, as Joachim Pileborg mentioned in his comment, the repeated stack frame implies that you probably corrupted your stack somewhere. That is usually the result of writing past the end of an array, into the stored frame pointer.
If you want to check segmentation faults which are causing due to memory related problem or want to check leak of memory than its better to use Valgrind which gives all information regarding memory leak.

Under Linux, how do I track down a memory leak in pre-built software?

I have a new Ubuntu Linux Server 64bit 10.04 LTS.
A default install of Mysql with replication turned on appears to be leaking memory.
However, we've tried going back to an earlier version and memory is still leaking but I can't tell where.
What tools/techniques can I use to pinpoint where memory is leaking so that I can rectify the problem?
Valgrind, http://valgrind.org/, can be very useful in these situations. It runs on unmodified executables but it does help tremendously if you can install the debugging symbols. Be sure to use the --show-reachable=yes flag as the leaked memory may still be reachable in some way but just not the way you want it. Also --trace-children in case of a fork. You'll likely have to track down in the start-up script where the executable is called and then add something like the following:
valgrind --show-reachable=yes --trace-children=yes --log-file=/path/to/log SQL-cmdline sqlargs
The man page has lots of other potentially useful options.
Have you tried the MySQL mailing list? Something like this would certainly be of interest to them if you can reproduce it in a straightforward manner.
You can use Valgrind as ninjalj suggests, but I doubt you'll get that close to anything useful. Even if you see a real leak (and they will be hard enough to validate), tracking down the root cause through the C call stacks will likely be very annoying (for example if the leak is triggered by a particular SQL pattern or stored procedure, you'll be looking at the call stack from the resultant optimized query, and not the original calls, which are likely in a different language).
Normally you might have no recourse, and have to resort to tracking it down through callstacks and iterative testing, but you have the source code to MySQL (including the source for the exact default package install), so you can use more advanced tools like MemoryScape (or at least build with symbols in order to provide Valgrind more food for thought).
Try using valgrind.
A very good and powerful tool, which is installed/available for most distributions is Valgrind.
It has a plethora of different options and is pretty much (as far as I've seen) the default profiler under linux systems.

Does mprotect flush the instruction cache on ARM Linux?

I am writing a JIT on ARM Linux that executes an instruction set that contains self-modifying code. The instruction set does not have any cache flush instructions (similar to x86 in that respect).
If I write out some code to a page and then call mprotect on that page, is that sufficient to invalidate the instruction cache? Or do I also need to use the cacheflush syscall on those pages?
You'd expect that the mmap/mprotect syscalls would establish mappings that are updated immediately, and need no further interaction to use the memory ranges as specified. I see that the kernel does indeed flush caches on mprotect. In that case, no cache flush would be required.
However, I also see that some versions of libc do call cacheflush after mprotect, which would imply that some environments would need the caches flushed (or have previously). I'd take a guess that this is a workaround to a bug.
You could always add the call to cacheflush; although it's extra code, it shouldn't be to harmful - at worst, the caches will already be flushed. You could always write a quick test and see what happens...
In Linux specifically, mprotect DOES cacheflush all caches since at least version 2.6.39 (and even before that for sure). You can see that in the code:
https://elixir.bootlin.com/linux/v2.6.39.4/source/mm/mprotect.c#L122 .
If you are writing a POSIX portable code, I would call cacheflush as the standard C library is not demanding such behavior from the kernel, nor from the implementation.
Edit: You should also be carefull and check what flush_cache_range does in the specific architecture you are implementing for, as in some architecture (like ARM64) this function does nothing...
I believe you do not have to explicitly flush the cache.
Which processor is this? ARMv5? ARMv7?

Can you recommend a good debugging malloc library for linux?

Can you recommend a good debugging malloc library for linux? I know there are a lot of options out there, I just need to know which libraries people are actually using to solve real-life problems.
Thanks!
EDIT: I know about Valgrind, but sometimes the performance is really too low.
Valgrind. :-) It's not a malloc library, but, it's really good at finding memory management and memory usage bugs.
http://valgrind.org/ for finding memory leaks and heap corruption.
http://dmalloc.com/ for general purpose heap debugging.
gcc now comes with sanitizers which are much more faster than valgrind. you can check different compiler options under -fsanitize. More info here
The GNU C library itself has some debugging features and hooks you can use to add your own.
For documentation on a Linux system type info libc and then g Heap<TAB>. Another useful info node is "Hooks for Malloc", you can get there with g Hooks<TAB>
This might not be very useful to you, but you could write your own malloc wrapper. In our special "diagnostic" builds it keeps a table of all outstanding allocations (including the file name and line number where the allocation occurred) and prints out anything that was still outstanding at exit time. It also uses canary words (to check for buffer overflows) and a combination of memory re-writing and block checksumming after free and before reallocation (to check for use-after-free).
If your product is sufficiently large it might be annoying to have to find-replace your entire source, hoping for the best. Also, the development time for your own malloc wrapper is probably not negligible. Doing lots of heavyweight stuff like what I mentioned above probably won't help out your speed problem, either. Writing your own wrapper would allow the most flexibility, though.

Resources