Finding allocation site for double-free errors (with valgrind) - linux

Given a double-free error (reported by valgrind), is there a way to find out where the memory was allocated? Valgrind only tells me the location of the deallocation site (i.e. the call to free()), but I would like to know where the memory was allocated.

To get Valgrind keep tracks of allocation stack traces, you have to use options:
--track-origins=yes --keep-stacktraces=alloc-and-free
Valgrind will then report allocation stack under Block was alloc'd at section, just after Address ... inside a block of size x free'd alert.
In case your application is large, --error-limit=no --num-callers=40 options may be useful too.

The first check I would do is verifying that the error is indeed due to a double-free error. Sometimes, running a program (including with valgrind) can show a double-free error while in reality, it's a memory corruption problem (for example a memory overflow).
The best way to check is to apply the advice detailed in the answers : How to track down a double free or corruption error in C++ with gdb.
First of all, you can try to compile your program with flags fsanitize=address -g. This will instrument the memory of the program at runtime to keep track of all allocations, detect overflows, etc.
In any case, if the problem is indeed a double-free, the error message should contain all the necessary information for you to debug the problem.

Related

Valgrind shows memory leak but no memory allocation took place

this is a rather simple question.
In my school we use a remote CentOS server to compile and test our programs. For some reason, valgrind always shows a 4096B leak in spite of the fact that no malloc was used. Does anyone here have any idea where this problem might stem from?
Your program makes a call to printf. This library might allocate memory for its own usage. More generally, depending on the OS/libc/..., various allocations might be done just to start a program.
Note also that in this case, you see that there is one block still allocated at exit, and that this block is part of the suppressed count. That means that valgrind suppression file already ensures that this memory does not appear in the list of leaks to be examined.
In summary: no problem.
In any case, when you suspect you have a leak, you can look at the details of the leaks e.g. their allocation stack trace to see if these are triggered by your application.
In addition to #phd's answer, there are a few things you can do to see more clearly what is going on.
If you run Valgrind with -s or -v it will show details of the suppressions used.
You can use --trace-malloc=yes to see all calls to allocation functions (only do that for small applications). Similarly you can run with --default-suppressions=no and than you will see the details of the memory (with --leak-check=full --show-reachable=yes in this case)
Finally, are you using an old Centos / GNU libc? A few years ago Valgrind got a mechanism to cleanup things like io buffers so you shouldn't get this sort of message with recent Valgrind and recent Linux + libc.

Does a stack overflow always lead to a segmentation fault error?

In Linux, when a running program attempts to use more stack space than the limit (a stack overflow), that usually results in a "segmentation fault" error and the execution is aborted.
Is it guaranteed that exceeding the stack space limit will always cause a segmentation fault error? Or could it happen that the program continues to run, possibly with some erroneous behavior due to data having been corrupted?
Another way of putting this: if a program misbehaves by producing wrong results but without a crash, can the cause still be a stack overflow?
Edit: to clarify, this question is not about "stack buffer overflow", it is about stack overflow, when the stack space used by the program exceeds the stack size limit (the limit that is in Linux given by ulimit -s).
A stack overflow turning into an access violation requires memory management hardware of some sort. Without hardware-assisted memory protection, an overgrown stack will collide with some other memory allocation, causing mutual corruption.
On demand-paged virtual memory operating systems, the upper limit of the stack is protected by a guard page: a page of virtual memory which is reserved (will not be allocated to anything) and marked "not present" so that accessing it generates a violation. A guard page is only so many bytes wide; a stack pointer can still accidentally increment over the guard page and land in some unrelated writable memory (such as a mapping belonging to a heap allocation) where havoc will be wreaked without necessarily triggering any memory access violation.
In the C language we can easily cause large stack increments by declaring large, uninitialized non-static block-scoped arrays, like char array[8192]; // (twice as large as a 4096 byte guard page). Using features like alloca or C99 variable-length arrays, we can do this dynamically: we can write a program which reads an integer value as a run-time input, and increments the stack by that much.
I debugged a problem many years ago whereby third-party code had debug logging macros, inside whose expansions there was a temporay array like char print_buf[8192] that was used for formatting messages. This was used in a multi-threaded application with many threads, whose stacks were reduced to just 64 kilobytes in size. Thanks to this print_buf, a thread's overflown stack leaped right past the guard page, and landed in another thread's stack, corrupting its local variables, causing the proverbial "hilarity to ensue".

Following memory allocation in gdb

Why is memory consumption jumping unpredictably as I step through a program in the gdb debugger? I'm trying to use gdb to find out why a program is using far more memory than it should, and it's not cooperating.
I step through the source code while monitoring process memory usage, but I can't find what line(s) allocate the memory for two reasons:
Reported memory usage only jumps up in increments of (usually, but not always exactly) 64 MB. I suspect I'm seeing the effects of some memory manager I don't know about which reserves 64 MB at a time and masks multiple smaller allocations.
The jump doesn't happen at a consistent location in code. Not only does it occur on different lines during different gdb runs; it also sometimes happens in illogical places like the closing bracket of a (c++) function. Is it possible that gdb itself is affecting memory allocations?
Any ideas/suggestions for more effective tools to help me drill down to the code lines that are really responsible for these memory allocations?
Here's some relevant system info: I'm running x86_64-redhat-linux-gnu version 7.2-64.el6-5.2 on a virtual CentOS Linux machine under Windows. The program is built on a remote server via a complicated build script, so tracking down exactly what options were used at any point is itself a bit of a chore. I'm monitoring memory usage both with the top utility ("virt" or virtual memory column) and by reading the real-time monitoring file /proc/<pid>/status, and they agree. Since this program uses a large suite of third-party libraries, there may be one or more overridden malloc() functions involved somewhere that I don't know about--hunting them down is part of this task.
gdb, left to its own devices, will not affect the memory use of your program, though a run under gdb may differ from a standalone run for other reasons.
However, this also depends on the way you use gdb. If you are just setting simple breakpoints, stepping, and printing things, then you are ok. But sometimes, to evaluate an expression, gdb will allocate memory in the inferior. For example, if you have a breakpoint condition like strcmp(arg, "string") == 0, then gdb will allocate memory for that string constant. There are other cases like this as well.
This answer is in several parts because there were several things going on:
Valgrind with the Massif module (a memory profiler) was much more helpful than gdb for this problem. Sometimes a quick look with the debugger works, sometimes it doesn't. http://valgrind.org/docs/manual/ms-manual.html
top is a poor tool for profiling memory usage because it only reports virtual memory allocations, which in this case were about 3x the actual heap memory usage. Virtual memory is mapped and made available by the Unix kernel when a process asks for a memory block, but it's not necessarily used. The underlying system call is mmap(). I still don't know how to check the block size. top can only tell you what the Unix kernel knows about your memory consumption, which isn't enough to be helpful. Don't use it (or the memory files under /proc/) to do detailed memory profiling.
Memory allocation when stepping out of a function was caused by autolocks--that's a thread lock class whose destructor releases the lock when it goes out of scope. Then a different thread goes into action and allocates some memory, leaving the operator (me) mystified. Non-repeatability is probably because some threads were waiting for external resources like Internet connections.

clGetPlatformIDs Memory Leak

I'm testing my code on Ubuntu 12.04 with NVIDIA hardware.
No actual OpenCL processing takes place; but my initialization code is still running. This code calls clGetPlatformIDs. However, Valgrind is reporting a memory leak:
==2718== 8 bytes in 1 blocks are definitely lost in loss record 4 of 74
==2718== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2718== by 0x509ECB6: ??? (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
==2718== by 0x50A04E1: ??? (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
==2718== by 0x509FE9F: clGetPlatformIDs (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
I was unaware this was even possible. Can this be fixed? Note that no special deinitialization is currently taking place--do I need to call something after this? The docs don't mention anything about having to deallocate anything.
regarding: "Check this out: devgurus.amd.com/thread/136242. valgrind cannot deal with custom memory allocators by design, which OpenCL is likely using"
to quote from the link given: "The behaviour not to free pools at the exit could be called a bug of the library though."
If you want to create a pool of memory and allocate from that, go ahead; but you still should properly deallocate it. The complexity of a memory pool as a whole is no less complex then the complexity of a regular memory reference and deserves at least the same attention, if not more, then that of regular references. Also, an 8 byte structure is highly unlikely to be a memory pool.
Tim Child would have a point about how you use clGetPlatformIds if it was designed to return allocated memory. However, reading http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetPlatformIDs.html I am not sufficiently convinced this should be the case.
The leak in question may or may not be serious, may or may not accumulate by successive calls, but you might be left only with the option to report the bug to nvidia in hopes they fix it or to find a different opencl implementation for development. Still, there might be reasons for an opencl library to create references to data which from the viewpoint of valgrind are not in use.
Sadly, this still leaves us with a memory leak caused by an external factor we cannot control, and it still leaves us with excess valgrind output.
Say you are sufficiently sure you are not responsible for this leak (say, we know for a fact that an nvidia engineer allocated a random value in OpenCL.so which he didn't deallocate just to spite you). Valgrind has a flag --gen-suppressions=yes, with which you can suppress warnings about particular warnings, which you can feed back to valgrind using --suppressions=$filename. Read the valgrind page for more details about how it works.
Be very wary of using suppressions though. Obviously suppressing errors does not fix them, and liberal usage of the mechanism will lead to situations where you suppress errors made by your code, rather then nvidia or valgrind. Do not suppress warnings of which you are not absolutely sure of where they come from, or regularly reassert your suppressions.

kernel BUG at page_alloc.c

I know it can be anything, but what in general does the following kernel message might indicate:
<2>kernel BUG at page_alloc.c:116!
This architecture does not implement dump_stack()
Kernel panic: Kernel Bug
In interrupt handler - not syncing
<0>Rebooting in 5 seconds..
This happens on a 2.4.20 uclinux-based system (ARM9 MMU-less cpu). Seems like a bad thing has happened during an interrupt handling: faulty RAM, so the kernel could not allocate memory or anything else ?
Would be thankful for any hints.
You should probably check line 116 of page_alloc.c in your kernel sources to see what condition triggers this particular BUG message.
Though the fact that you're running on an MMU-less system leads me to suspect that a buggy user process has stomped on part of the kernel's memory.
this clearly looks like an heap or stack corruption, just try to put print in file page_alloc.c, try to print the address of the variable which is accessed before the panic line:116, this will give you some hint if heap or stack corruption has happened.
If you find that its a stack corruption then try to look which variable is declared before the corrupted variable because that might be the culprit variable, this might help you to debug.
If its a heap corruption then its something difficult to debug, then you need find out if some variable allocating less memory but writing more data than allocated bytes.

Resources