Problems allocating memory with Contiguous Memory Allocator (CMA) Linux device driver development

Problems allocating memory with Contiguous Memory Allocator (CMA) Linux device driver development - linux

I am trying to test Contiguous Memory Allocator for DMA mapping framework. I have compiled kernel 3.5.7 with CMA support, I know that it is experimental but it should work.
My goal is to allocate several 32MB physically contiguous memory chunks in kernel module for device without scatter/gather capability.
I am testing my system with test patch from Barry Song: http://thread.gmane.org/gmane.linux.kernel/1263136
But when I try to allocate memory with echo 1024 > /dev/cma_test. I get bash: echo: write error: No space left on device. And in dmesg:misc cma_test: no mem in CMA area
What could be the problem? What am I missing? System is freshly rebooted and there should be at least 350mb of free contiguous memory because bigphysarea patch on kernel 3.2 were able to allocate that amount on similar system.
Thank you for your time!

At the end I have decided to use kernel 3.5 and bigphysarea patch(from 3.2). It is easy and works like a charm.
CMA is great option as well but it is a bit harder to use an debug(CMA needs actual device). I have used up all my skills to find what was the problem. Printk inside kernel code was only possibility to debug this one.

Related

At boot, how does the Linux kernel allocate memory for its own memory allocator?

I'm building a small x86-64 kernel that I write completely from scratch. I'm starting to write a very simple memory allocator where I simply iterate on page structs to find a free page and break the loop when I find one. When I began writing the allocator, I stumbled upon an issue.
In my kernel, I managed to get a memory map of RAM using a UEFI call (GetMemoryMap()). By iterating on the memory map, I managed to find that I would have around 1021MB of usable memory (out of 1024MB). I test my kernel on QEMU.
I read here and there that the Linux kernel holds a page struct for every page in the system. I'm guessing that the memory allocator of the Linux kernel uses the page structs to determine what page is free and which isn't. By using a proper binary structure and an efficient algorithm, it attempts to find a free page as fast as possible to use for itself or to provide to a user mode process.
If my assumption is correct, the issue arises here. If the Linux kernel's memory allocator relies on the page struct to work, how does it allocate memory for the page structs themselves?
I thought of a simple algorithm where I simply start at the physical address of the first usable memory region. If this memory region has enough room for all the page structs (representing the whole memory), I stop there. Otherwise, I use the second memory region an so on. This seems quite simple but I wanted to know how the Linux kernel handles that issue.
Even if my above assumption is wrong, the memory allocator probably requires some memory to work. At boot, how does the Linux kernel allocate memory for its own memory allocator?

I guess you can google some stuff about memblock allocator, which manages the physical memory in kernel boot phase.

Mmap DMA Coherent Memory to User Space

I am trying to map DMA coherent memory, which I allocated in my kernel driver, to user space. There I use mmap() and in kernel driver I use dma_alloc_coherent() and afterwards remap_pfn_range() to remap the pages.
The purpose of mapping the DMA memory to User Space is to minimize the ioctl access to the kernel. The host must perform quite a high number of DMA coherent memory accesses and I want to access it directly in User Space instead of wasting time by using countless ioctl() operations.
mmap() returns EPERM (1) - Operation not permitted.
I found this post: mmap: Operation not permitted
Answer:
It sounds like the kernel has been compiled with CONFIG_STRICT_DEVMEM
enabled. This is a security feature to prevent user space access to
(possibly sensitive) physical memory above 1MB (IIRC). You might be
able to disable this with sysctl dev.mem.restricted.
That is the only useful info I've found. However, I see 2 issues:
1) I've allocated for test purposes only 4k. According to the above statement, only physical memory > 1MB should be a problem. I still can't mmap (anyway, for the final driver I would need a lot more dma memory, but recompiling the kernel can't be the solution to my problem) Which leads me to 2)
2) Furthermore, re-compiling the kernel is not an option as the driver should work without tweaking the kernel in a specific way.
Any ideas on this one? I appreciate the help.
I am using Ubuntu 16.04.1, Kernel: 4.10.0-40-generic
EDIT: SOLVED
I made a copy-paste mistake which resulted in a ret=-1. So the .mmap function in the kernel driver which calls remap_pfn_range, returned -1 instead of 0. This resulted in a failing mmap() in user space

Can I allocate one large and guaranteed continued range physical memory (100MB)?

Can I allocate one large and guaranteed continued range physical memory (100 MB consecutive without breaks) on Linux, and if I can, then how can I do this?
It is necessary to mapping this a continuous block of memory through the PCI-Express BAR from one CPU1 to the other CPU2 located behind the PCIe Non-Transparent Bridge.

You don't allocate physical memory in user applications (physical memory only makes sense inside the kernel).
I don't understand if you are coding a kernel module or some Linux application (e.g. a numerical finite-element code=.
Inside applications, you can allocate virtual memory with e.g. mmap(2) (and then you can allocate a big contiguous segment of address space)
I guess that some GPU cards give access to a large amount of GPU memory thru mmap so I believe it is possible to do what you want.
You might be interested by numa(7) man page. Probably the numa(3) library should give you what you want. Did you consider also open MPI? See also msync(2) and mlock(2)

From user space -- there is no guarantee depends on you luck.
if you compile your driver into the kernel -- you can use the mmap and allocate the required amount of memory.
if it is required to use it as storage or some other work not specifically for a driver then you should set the memmap parameter in the boot command line.
e.g. memmap=200M$1700M
it will block 200 MB memory starting from the end of 1700M (address).
Later it can be used to as FS as well ;)

Device memory for CUDA kernel code: Is it explicitly manageable?

Context:
CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.
I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)
When is the kernel code loaded into device memory? I guess at execution of the host code line:
kernel<<<geom>>>(params);
Right?
And does the code remain in device memory after the call? If so, can I explicitly unload the code?
What concerns me is device memory fragmentation. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?

The memory allocation you are observing is used by the CUDA context. It doesn't only hold kernel code, it holds any other static scope device symbols, textures, per-thread scratch space for local memory, printf and heap, constant memory, as well as gpu memory required by the driver and CUDA runtime itself. Most of this memory is only ever allocated once, when a binary module is loaded, or PTX code is JIT compiled by the driver. It is probably best to think of it as a fixed overhead, rather than a leak. There is a 2 million instruction limit in PTX code, and current hardware uses 32 bit words for instructions, so the memory footprint of even the largest permissible kernel code is small compared to the other global memory overheads it requires.
In recent versions of CUDA there is a runtime API call cudaDeviceSetLimit which permits some control over the amount of scratch space a given context can consume. Be aware that it is possible to set the limits to values which are lower than the device code requires, in which case runtime execution failures can result.

Force Linux to use only memory over 4G?

I have a Linux device driver that interfaces to a device that, in theory, can perform DMA using 64-bit addresses. I'd like to test to see that this actually works.
Is there a simple way that I can force a Linux machine not to use any memory below physical address 4G? It's OK if the kernel image is in low memory; I just want to be able to force a situation where I know all my dynamically allocated buffers, and any kernel or user buffers allocated for me are not addressable in 32 bits. This is a little brute force, but would be more comprehensive than anything else I can think of.
This should help me catch (1) hardware that wasn't configured correctly or loaded with the full address (or is just plain broken) as well as (2) accidental and unnecessary use of bounce buffers (because there's nowhere to bounce to).
clarification: I'm running x86_64, so I don't care about most of the old 32-bit addressing issues. I just want to test that a driver can correctly interface with multitudes of buffers using 64-bit physical addresses.

/usr/src/linux/Documentation/kernel-parameters.txt
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn#ss
option description.
memmap=nn[KMG]#ss[KMG]
[KNL] Force usage of a specific region of memory
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]#ss[KMG]
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
or
memmap=0x10000$0x18690000
If you add memmap=4G$0 to the kernel's boot parameters, the lower 4GB of physical memory will no longer be accessible. Also, your system will no longer boot... but some variation hereof (memmap=3584M$512M?) may allow for enough memory below 4GB for the system to boot but not enough that your driver's DMA buffers will be allocated there.

IIRC there's an option within kernel configuration to use PAE extensions which will enable you to use more than 4GB (I am a bit rusty on the kernel config - last kernel I recompiled was 2.6.4 - so please excuse my lack of recall). You do know how to trigger a kernel config
make clean && make menuconfig
Hope this helps,
Best regards,
Tom.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string