Device memory for CUDA kernel code: Is it explicitly manageable?

Device memory for CUDA kernel code: Is it explicitly manageable? - memory-leaks

Context:
CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.
I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)
When is the kernel code loaded into device memory? I guess at execution of the host code line:
kernel<<<geom>>>(params);
Right?
And does the code remain in device memory after the call? If so, can I explicitly unload the code?
What concerns me is device memory fragmentation. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?

The memory allocation you are observing is used by the CUDA context. It doesn't only hold kernel code, it holds any other static scope device symbols, textures, per-thread scratch space for local memory, printf and heap, constant memory, as well as gpu memory required by the driver and CUDA runtime itself. Most of this memory is only ever allocated once, when a binary module is loaded, or PTX code is JIT compiled by the driver. It is probably best to think of it as a fixed overhead, rather than a leak. There is a 2 million instruction limit in PTX code, and current hardware uses 32 bit words for instructions, so the memory footprint of even the largest permissible kernel code is small compared to the other global memory overheads it requires.
In recent versions of CUDA there is a runtime API call cudaDeviceSetLimit which permits some control over the amount of scratch space a given context can consume. Be aware that it is possible to set the limits to values which are lower than the device code requires, in which case runtime execution failures can result.

Related

Replace/Modify kzalloc_node in kernel code so that the function allocates more than 4MB of memory

Architecture : x86-64
Linux Version : 4.11.3
This is in reference to the below Stack Overflow post :-
Allocating more than 4 MB of pinned contiguous memory in the Linux Kernel
I see that the question was asked for a PCI driver, which requested for more than 4 MB of contiguous memory in the kernel. However, my intention was to use another function in place of kzalloc_node function (or modify it!). I want to modify the kernel code (if feasible) so that somehow I can allocate more than 4 MB of contiguous memory, which kzalloc_node does not allow me to do. Of course, it will be difficult to modify MAX_ORDER as it may give rise to compiler errors. Also here the kzalloc_node function is computing the node corresponding to the CPU - so the allocation of memory happens at the node level.
Background
Basically I am trying to increase the size of a sampling buffer so as to reduce the overhead that it incurs when it gets full and interrupts need to be raised to read in the data from the buffer. So I am trying to reduce the number of interrupts and thereby, need to increase the size of the buffer. The kernel code is using kzalloc_node to allocate memory, and hence it cannot get more than 4 MB contiguous memory. I want to know what mechanisms I have to either replace this function/allocate more memory ?
Can I replace this function ? Since I am trying to modify the kernel code, do the same boot-time allocation methods apply here ? I read that this mechanism applies for device drivers, can I also use it ?

mmap() device memory into user space

Saying if we do a mmap() system call and maps some PCIE device memory (like GPU) into the user space, then application can access those memory region in the device without any OS overhead. Data can by copied from file system buffer directly to device memory without any other copy.
Above statement must be wrong... Can anyone tell me where is the flaw? Thanks!

For a normal device what you have said is correct. If the GPU memory behaves differently for reads/write, they might do this. We should look at some documentation of cudaMemcpy().
From Nvidia's basics of CUDA page 22,
direction specifies locations (host or device) of src and dst
Blocks CPU thread: returns after the copy is complete.
Doesn't start copying until previous CUDA calls complete
It seems pretty clear that the cudaMemcpy() is synchronized to prior GPU registers writes, which may have caused the mmap() memory to be updated. As the GPU pipeline is a pipeline, prior command issues may not have completed when cudaMemcpy() is issued from the CPU.

Can I allocate one large and guaranteed continued range physical memory (100MB)?

Can I allocate one large and guaranteed continued range physical memory (100 MB consecutive without breaks) on Linux, and if I can, then how can I do this?
It is necessary to mapping this a continuous block of memory through the PCI-Express BAR from one CPU1 to the other CPU2 located behind the PCIe Non-Transparent Bridge.

You don't allocate physical memory in user applications (physical memory only makes sense inside the kernel).
I don't understand if you are coding a kernel module or some Linux application (e.g. a numerical finite-element code=.
Inside applications, you can allocate virtual memory with e.g. mmap(2) (and then you can allocate a big contiguous segment of address space)
I guess that some GPU cards give access to a large amount of GPU memory thru mmap so I believe it is possible to do what you want.
You might be interested by numa(7) man page. Probably the numa(3) library should give you what you want. Did you consider also open MPI? See also msync(2) and mlock(2)

From user space -- there is no guarantee depends on you luck.
if you compile your driver into the kernel -- you can use the mmap and allocate the required amount of memory.
if it is required to use it as storage or some other work not specifically for a driver then you should set the memmap parameter in the boot command line.
e.g. memmap=200M$1700M
it will block 200 MB memory starting from the end of 1700M (address).
Later it can be used to as FS as well ;)

Problems allocating memory with Contiguous Memory Allocator (CMA) Linux device driver development

I am trying to test Contiguous Memory Allocator for DMA mapping framework. I have compiled kernel 3.5.7 with CMA support, I know that it is experimental but it should work.
My goal is to allocate several 32MB physically contiguous memory chunks in kernel module for device without scatter/gather capability.
I am testing my system with test patch from Barry Song: http://thread.gmane.org/gmane.linux.kernel/1263136
But when I try to allocate memory with echo 1024 > /dev/cma_test. I get bash: echo: write error: No space left on device. And in dmesg:misc cma_test: no mem in CMA area
What could be the problem? What am I missing? System is freshly rebooted and there should be at least 350mb of free contiguous memory because bigphysarea patch on kernel 3.2 were able to allocate that amount on similar system.
Thank you for your time!

At the end I have decided to use kernel 3.5 and bigphysarea patch(from 3.2). It is easy and works like a charm.
CMA is great option as well but it is a bit harder to use an debug(CMA needs actual device). I have used up all my skills to find what was the problem. Printk inside kernel code was only possibility to debug this one.

CUDA Process Memory Usage [duplicate]

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.
Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?
Can anyone help me answer these questions? Thanks
Edit 1: operating system: x86_64 GNU/Linux
CUDA version: 4.0
Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
Edit 2: The following is what I got after doing some research. Feel free to correct me.
CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.
New questions:
1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory.
2. Is there any way to restructure the GPU memory ?

The device memory available to your code at runtime is basically calculated as
Free memory = total memory
- display driver reservations
- CUDA driver reservations
- CUDA context static allocations (local memory, constant memory, device code)
- CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
- CUDA context user allocations (global memory, textures)
if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.
The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string