kernel BUG at page_alloc.c

kernel BUG at page_alloc.c - linux

I know it can be anything, but what in general does the following kernel message might indicate:
<2>kernel BUG at page_alloc.c:116!
This architecture does not implement dump_stack()
Kernel panic: Kernel Bug
In interrupt handler - not syncing
<0>Rebooting in 5 seconds..
This happens on a 2.4.20 uclinux-based system (ARM9 MMU-less cpu). Seems like a bad thing has happened during an interrupt handling: faulty RAM, so the kernel could not allocate memory or anything else ?
Would be thankful for any hints.

You should probably check line 116 of page_alloc.c in your kernel sources to see what condition triggers this particular BUG message.
Though the fact that you're running on an MMU-less system leads me to suspect that a buggy user process has stomped on part of the kernel's memory.

this clearly looks like an heap or stack corruption, just try to put print in file page_alloc.c, try to print the address of the variable which is accessed before the panic line:116, this will give you some hint if heap or stack corruption has happened.
If you find that its a stack corruption then try to look which variable is declared before the corrupted variable because that might be the culprit variable, this might help you to debug.
If its a heap corruption then its something difficult to debug, then you need find out if some variable allocating less memory but writing more data than allocated bytes.

Related

Valgrind shows memory leak but no memory allocation took place

this is a rather simple question.
In my school we use a remote CentOS server to compile and test our programs. For some reason, valgrind always shows a 4096B leak in spite of the fact that no malloc was used. Does anyone here have any idea where this problem might stem from?

Your program makes a call to printf. This library might allocate memory for its own usage. More generally, depending on the OS/libc/..., various allocations might be done just to start a program.
Note also that in this case, you see that there is one block still allocated at exit, and that this block is part of the suppressed count. That means that valgrind suppression file already ensures that this memory does not appear in the list of leaks to be examined.
In summary: no problem.
In any case, when you suspect you have a leak, you can look at the details of the leaks e.g. their allocation stack trace to see if these are triggered by your application.

In addition to #phd's answer, there are a few things you can do to see more clearly what is going on.
If you run Valgrind with -s or -v it will show details of the suppressions used.
You can use --trace-malloc=yes to see all calls to allocation functions (only do that for small applications). Similarly you can run with --default-suppressions=no and than you will see the details of the memory (with --leak-check=full --show-reachable=yes in this case)
Finally, are you using an old Centos / GNU libc? A few years ago Valgrind got a mechanism to cleanup things like io buffers so you shouldn't get this sort of message with recent Valgrind and recent Linux + libc.

clGetPlatformIDs Memory Leak

I'm testing my code on Ubuntu 12.04 with NVIDIA hardware.
No actual OpenCL processing takes place; but my initialization code is still running. This code calls clGetPlatformIDs. However, Valgrind is reporting a memory leak:
==2718== 8 bytes in 1 blocks are definitely lost in loss record 4 of 74
==2718== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2718== by 0x509ECB6: ??? (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
==2718== by 0x50A04E1: ??? (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
==2718== by 0x509FE9F: clGetPlatformIDs (in /usr/lib/nvidia-current/libOpenCL.so.1.0.0)
I was unaware this was even possible. Can this be fixed? Note that no special deinitialization is currently taking place--do I need to call something after this? The docs don't mention anything about having to deallocate anything.

regarding: "Check this out: devgurus.amd.com/thread/136242. valgrind cannot deal with custom memory allocators by design, which OpenCL is likely using"
to quote from the link given: "The behaviour not to free pools at the exit could be called a bug of the library though."
If you want to create a pool of memory and allocate from that, go ahead; but you still should properly deallocate it. The complexity of a memory pool as a whole is no less complex then the complexity of a regular memory reference and deserves at least the same attention, if not more, then that of regular references. Also, an 8 byte structure is highly unlikely to be a memory pool.
Tim Child would have a point about how you use clGetPlatformIds if it was designed to return allocated memory. However, reading http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetPlatformIDs.html I am not sufficiently convinced this should be the case.
The leak in question may or may not be serious, may or may not accumulate by successive calls, but you might be left only with the option to report the bug to nvidia in hopes they fix it or to find a different opencl implementation for development. Still, there might be reasons for an opencl library to create references to data which from the viewpoint of valgrind are not in use.
Sadly, this still leaves us with a memory leak caused by an external factor we cannot control, and it still leaves us with excess valgrind output.
Say you are sufficiently sure you are not responsible for this leak (say, we know for a fact that an nvidia engineer allocated a random value in OpenCL.so which he didn't deallocate just to spite you). Valgrind has a flag --gen-suppressions=yes, with which you can suppress warnings about particular warnings, which you can feed back to valgrind using --suppressions=$filename. Read the valgrind page for more details about how it works.
Be very wary of using suppressions though. Obviously suppressing errors does not fix them, and liberal usage of the mechanism will lead to situations where you suppress errors made by your code, rather then nvidia or valgrind. Do not suppress warnings of which you are not absolutely sure of where they come from, or regularly reassert your suppressions.

how to force linux to allocate memory in high (64bit) address space

I'm trying to track down a segfault problem in an old C code (not written by me). The segfaults occur only, if the addresses of certain variables in that code exceed the 32bit integer limit. (So I've got a pretty good idea what's going wrong, but I don't know where.)
So, my question is: Is there any way to force linux to allocate memory for a process in the high address space? At the moment it's pretty much down to chance whether the segfault happen, which makes debugging a bit difficult.
I'm running Ubuntu 10.04, Kernel 2.6.31-23-generic on a Dell inspiron 1525 laptop with 2GB ram, if that's any help.
Thanks in advance,
Martin.

You can allocate an anonymous block of memory with the mmap() system call, which you can pass as an argument where you want it to be mapped.

I would turn on the -Wpointer-to-int-cast and -Wint-to-pointer-cast warning options and check out any warnings they turn up (I believe these are included in -Wall on 64-bit targets). The cause is very likely something related to this, and simply auditing the warnings the compiler turns up may be a better approach than using a debugger.

Finding allocation site for double-free errors (with valgrind)

Given a double-free error (reported by valgrind), is there a way to find out where the memory was allocated? Valgrind only tells me the location of the deallocation site (i.e. the call to free()), but I would like to know where the memory was allocated.

To get Valgrind keep tracks of allocation stack traces, you have to use options:
--track-origins=yes --keep-stacktraces=alloc-and-free
Valgrind will then report allocation stack under Block was alloc'd at section, just after Address ... inside a block of size x free'd alert.
In case your application is large, --error-limit=no --num-callers=40 options may be useful too.

The first check I would do is verifying that the error is indeed due to a double-free error. Sometimes, running a program (including with valgrind) can show a double-free error while in reality, it's a memory corruption problem (for example a memory overflow).
The best way to check is to apply the advice detailed in the answers : How to track down a double free or corruption error in C++ with gdb.
First of all, you can try to compile your program with flags fsanitize=address -g. This will instrument the memory of the program at runtime to keep track of all allocations, detect overflows, etc.
In any case, if the problem is indeed a double-free, the error message should contain all the necessary information for you to debug the problem.

Debugging SIGBUS on x86 Linux

What can cause SIGBUS (bus error) on a generic x86 userland application in Linux? All of the discussion I've been able to find online is regarding memory alignment errors, which from what I understand doesn't really apply to x86.
(My code is running on a Geode, in case there are any relevant processor-specific quirks there.)

SIGBUS can happen in Linux for quite a few reasons other than memory alignment faults - for example, if you attempt to access an mmap region beyond the end of the mapped file.
Are you using anything like mmap, shared memory regions, or similar?

You can get a SIGBUS from an unaligned access if you turn on the unaligned access trap, but normally that's off on an x86. You can also get it from accessing a memory mapped device if there's an error of some kind.
Your best bet is using a debugger to identify the faulting instruction (SIGBUS is synchronous), and trying to see what it was trying to do.

SIGBUS on x86 (including x86_64) Linux is a rare beast. It may appear from attempt to access past the end of mmaped file, or some other situations described by POSIX.
But from hardware faults it's not easy to get SIGBUS. Namely, unaligned access from any instruction — be it SIMD or not — usually results in SIGSEGV. Stack overflows result in SIGSEGV. Even accesses to addresses not in canonical form result in SIGSEGV. All this due to #GP being raised, which almost always maps to SIGSEGV.
Now, here're some ways to get SIGBUS due to a CPU exception:
Enable AC bit in EFLAGS, then do unaligned access by any memory read or write instruction. See this discussion for details.
Do canonical violation via a stack pointer register (rsp or rbp), generating #SS. Here's an example for GCC (compile with gcc test.c -o test -masm=intel):
int main()
{
__asm__("mov rbp,0x400000000000000\n"
"mov rax,[rbp]\n"
"ud2\n");
}

Oh yes there's one more weird way to get SIGBUS.
If the kernel fails to page in a code page due to memory pressure (OOM killer must be disabled) or failed IO request, SIGBUS.

This was briefly mentioned above as a "failed IO request", but I'll expand upon it a bit.
A frequent case is when you lazily grow a file using ftruncate, map it into memory, start writing data and then run out of space in your filesystem. Physical space for mapped file is allocated on page faults, if there's none left then process receives a SIGBUS.
If you need your application to correctly recover from this error, it makes sense to explicitly reserve space prior to mmap using fallocate. Handling ENOSPC in errno after fallocate call is much simpler than dealing with signals, especially in a multi-threaded application.

You may see SIGBUS when you're running the binary off NFS (network file system) and the file is changed. See https://rachelbythebay.com/w/2018/03/15/core/.

If you request a mapping backed by hugepages with mmap and the MAP_HUGETLB flag, you can get SIGBUS if the kernel runs out of allocated huge pages and thus cannot handle a page fault.
In this case, you'll need to raise the number of allocated huge pages via
/sys/kernel/mm/hugepages/hugepages-<size>/nr_hugepages or
/sys/devices/system/node/nodeX/hugepages/hugepages-<size>/nr_hugepages on NUMA systems.

A common cause of a bus error on x86 Linux is attempting to dereference something that is not really a pointer, or is a wild pointer. For example, failing to initialize a pointer, or assigning an arbitrary integer to a pointer and then attempting to dereference it will normally produce either a segmentation fault or a bus error.
Alignment does apply to x86. Even though memory on an x86 is byte-addressable (so you can have a char pointer to any address), if you have for example an pointer to a 4-byte integer, that pointer must be aligned.
You should run your program in gdb and determine which pointer access is generating the bus error to diagnose the issue.

It's a bit off the beaten path, but you can get SIGBUS from an unaligned SSE2 (m128) load.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string