Memory leak in constant memory of CUDA-capable GPUs? - memory-leaks

In CUDA C progamming guide it is noted:
global function parameters are passed to the device:
via constant memory and are limited to 4 KB on devices of compute capability 2.x and higher.
Considering that the constant memory has a lifetime of an application, in a case that a kernel is called thousands of times within an application, I'd like to know if the function parameters are freed automatically after each time the kernel is completed?

Constant memory has a lifetime of an application, but it can be changed (asynchronously) from host code. Since there is a cache involved, there may be nuances around cache invalidation, but that is not germane to your question, I don't think.
Yes, the constant memory used for kernel call parameters is freed at the end of the kernel call, and can be reused for subsequent kernel calls,

Related

Does kmalloc() reserve Copy-On-Write (COW) mappings?

My understanding is that kmalloc() allocates from anonymous memory. Does this actually reserve the physical memory immediately or is that going to happen only when a write page fault happens?
kmalloc() does not usually allocate memory pages(1), it's more complicated than that. kmalloc() is used to request memory for small objects (smaller than the page size), and manages those requests using already existing memory pages, similarly to what libc's malloc() does to manage the heap in userspace programs.
There are different allocators that can be used in the Linux kernel: SLAB, SLOB and SLUB. Those use different approaches for the same goal: managing allocation and deallocation of kernel memory for small objects at runtime. A call to kmalloc() may use any of the three depending on which one was configured into the kernel.
A call to kmalloc() does not usually reserve memory at all, but rather just manages already reserved memory. Therefore memory returned by kmalloc() does not require a subsequent page fault like a page requested through mmap() normally would.
Copy on Write (CoW) is a different concept. Although it's still triggered through page faults, CoW is a mechanism used by the kernel to save space sharing existing mappings until they are modified. It is not what happens when a fault is triggered on a newly allocated memory page. A great example of CoW is what happens when the fork syscall is invoked: the process memory of the child process is not immediately duplicated, instead existing pages are marked as CoW, and the duplication only happens at the first attempt to write.
I believe this should clear any doubt you have. The short answer is that kmalloc() does not usually "reserve the physical memory immediately" because it merely allocates an object from already reserved memory.
(1) Unless the request is very large, in which case it falls back to actually allocating new memory through alloc_pages().

Replace/Modify kzalloc_node in kernel code so that the function allocates more than 4MB of memory

Architecture : x86-64
Linux Version : 4.11.3
This is in reference to the below Stack Overflow post :-
Allocating more than 4 MB of pinned contiguous memory in the Linux Kernel
I see that the question was asked for a PCI driver, which requested for more than 4 MB of contiguous memory in the kernel. However, my intention was to use another function in place of kzalloc_node function (or modify it!). I want to modify the kernel code (if feasible) so that somehow I can allocate more than 4 MB of contiguous memory, which kzalloc_node does not allow me to do. Of course, it will be difficult to modify MAX_ORDER as it may give rise to compiler errors. Also here the kzalloc_node function is computing the node corresponding to the CPU - so the allocation of memory happens at the node level.
Background
Basically I am trying to increase the size of a sampling buffer so as to reduce the overhead that it incurs when it gets full and interrupts need to be raised to read in the data from the buffer. So I am trying to reduce the number of interrupts and thereby, need to increase the size of the buffer. The kernel code is using kzalloc_node to allocate memory, and hence it cannot get more than 4 MB contiguous memory. I want to know what mechanisms I have to either replace this function/allocate more memory ?
Can I replace this function ? Since I am trying to modify the kernel code, do the same boot-time allocation methods apply here ? I read that this mechanism applies for device drivers, can I also use it ?

libc memory management

How does libc communicate with the OS (e.g., a Linux kernel) to manage memory? Specifically, how does it allocate memory, and how does it release memory? Also, in what cases can it fail to allocate and deallocate, respectively?
That is very general question, but I want to speak to the failure to allocate. It's important to realize that memory is actually allocated by kernel upon first access. What you are doing when calling malloc/calloc/realloc is reserving some addresses inside the virtual address space of a process (via syscalls brk, mmap, etc. libc does that).
When I get malloc or similar to fail (or when libc get brk or mmap to fail), it's usually because I exhausted the virtual address space of a process. This happens when there is no continuous block of free address, an no room to expand an existing one. You can either exhaust all space available or hit a limit RLIMIT_AS. It's pretty common especially on 32bit systems when using multiple threads, because people sometimes forget that each thread needs it's own stack. Stacks usually consume several megabytes, which means you can create only few hundreds threads before you have no more free address space. Maybe an even more common reason for exhausted address space are memory leaks. Libc of course tries to reuse space on the heap (space obtained by a brk syscall) and tries to munmmap unneeded mappings. However, it can't reuse something that is not "deallocated".
The shortage of physical memory is not detectable from within a process (or libc which is part of the process) by failure to allocate. Yeah, you can hit "overcommitting limit", but that doesn't mean the physical memory is all taken. When free physical memory is low, kernel invokes special task called OOM killer (Out Of Memory Killer) which terminates some processes in order to free memory.
Regarding failure to deallocate, my guess is it doesn't happen unless you do something silly. I can imagine setting program break (end of heap) below it's original position (by a brk syscall). That is, of course, recipe for a disaster. Hopefully libc won't do that and it doesn't make much sense either. But it can be seen as failed deallocation. munmap can also fail if you supply some silly argument, but I can't think of regular reason for it to fail. That doesn't mean it doesn't exists. We would have to dig deep within source code of glibc/kernel to find out.
1) how does it allocate memory
libc provides malloc() to C programs.
Normally, malloc allocates memory from the heap, and adjusts the
size of the heap as required, using sbrk(2). When allocating blocks of
memory larger than MMAP_THRESHOLD bytes, the glibc malloc()
implementation allocates the memory as a private anonymous mapping
using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable
using mallopt(3). Allocations performed using mmap(2) are unaffected
by the RLIMIT_DATA resource limit (see getrlimit(2)).
And this is about sbrk.
sbrk - change data segment size
2) in what cases can it fail to allocate
Also from malloc
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
And from proc
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
Mostly it uses the sbrk system call to adjust the size of the data segment, thereby reserving more memory for it to parcel out. Memory allocated in that way is generally not released back to the operating system because it is only possible to do it when the blocks available to be released are at the end of the data segment.
Larger blocks are sometime done by using mmap to allocate memory, and that memory can be released again with an munmap call.
How does libc communicate with the OS (e.g., a Linux kernel) to manage memory?
Through system calls - this is a low-level API that the kernel provides.
Specifically, how does it allocate memory, and how does it release memory?
Unix-like systems provide the "sbrk" syscall.
Also, in what cases can it fail to allocate and deallocate, respectively?
Allocation can fail, for example, when there's no enough available memory. Deallocation shall not fail.

CUDA Process Memory Usage [duplicate]

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.
Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?
Can anyone help me answer these questions? Thanks
Edit 1: operating system: x86_64 GNU/Linux
CUDA version: 4.0
Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
Edit 2: The following is what I got after doing some research. Feel free to correct me.
CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.
New questions:
1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory.
2. Is there any way to restructure the GPU memory ?
The device memory available to your code at runtime is basically calculated as
Free memory = total memory
- display driver reservations
- CUDA driver reservations
- CUDA context static allocations (local memory, constant memory, device code)
- CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
- CUDA context user allocations (global memory, textures)
if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.
The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.
GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

How is memory allocated on heap without a system call?

I was wondering that if the space required on heap is not large enough
such that there is no need for a brk/sbrk system all (to shift the break pointer (brk) of data segment), how does a library function (such as malloc) allocates space on heap.
I am not asking about the data-structures and algorithms for heap management. I am just asking how does malloc get the address of the first location of the heap if it doesn't invoke a system call. I am asking this because I have heard that it is not always necessary to invoke a system call (brk/sbrk) as these are only required to expand the space.Please correct me if I am wrong.
The basic idea is that when your program starts, the heap is very small, but not necessarily zero. If you only allocate (malloc) a small amount of memory, the library is able to handle it within the small amount of space it has when it is loaded. However, when malloc runs out of that space, it needs to make a system call to get more memory.
That system call is often sbrk(), which moves the top of the heap's memory region up by a certain amount. Usually, the malloc library routine increases the heap by larger than what is needed for the current allocation, with the hope that future allocations can be performed w/o making a system call.
Other implementations of malloc use mmap() instead -- this allows the program to create a sparse virtual memory mapping. However, mmap() based malloc implementations do the same thing as the sbrk()-based ones: each system call reserves more memory than what is necessarily needed for the current call.
One way to look at this is to trace a program that uses malloc: you'll see that for N calls to malloc, you will see M system calls (where M is much smaller than N).
The short answer is that it uses sbrk() to allocate a big hunk, which at that point belongs to your app process. It can then further parcel out subsections of that as individual malloc calls without needing to ask the system for anything, until it exhausts that space and needs to resort sbrk() again.
You said you didn't want the details on the data structures, but suffice it to say that the implementation of malloc (i.e. your own process, not the OS kernel) is keeping track of which space in the region it got from the system is spoken for and which is still available to dole out as individual mallocs. It's like buying a big tract of land, then subdividing it into lots for individual houses.
Use sbrk() or mmap() — http://linux.die.net/man/2/sbrk, http://linux.die.net/man/2/mmap

Resources