Can malloc be used for dma? - malloc

I have some sample code which uses 2 test application:
test application in kernel , which allocates buffer using kmalloc
(GFP_KERNEL)
test application in userspace which allocates buffer using malloc
In this case the kernel driver use copy_from_user before dma transfer.
In both cases the buffers are DMA through PCI.
I don't understand how is it that the buffers are allocated with
malloc/kmalloc for DMA.
Does copy_from_user of area which was allocated using malloc, will result in contiguous memory for dma ?

Related

Memory-mapped IO vs DMA?

Based on what I have learned from the comments and answers (thanks everyone!), I edited the question to be more targeted:
DMA:
Before first DMA, CPU has to setup things like RAM address range reserved to be used by the device for DMA. Once the setup work is done, can the device initiate the transfer at will, basically owning that part of RAM, or it still has to get some sort of permission from CPU again before every sinlge DMA transfer?
MMIO:
CPU accessing the device memory via mmio is more expensive compared to CPU accessing RAM, but I can see on my desktop, pci devices reserve hundreds of mega bytes for mmio, what is an example that this can be used efficiently (As opposed to copy the data back to RAM using DMA and then access them)?
Look at it from the device's perspective. The device can:
directly access memory itself (using DMA itself)
wait for the CPU to transfer data to it (by providing memory mapped IO for CPU to use)
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
The CPU doesn't use DMA at all. The entire point of DMA is to allow the CPU to do other things (or nothing) while the device does the DMA. The end result is a significant performance increase for the system as a whole - e.g. CPU/s doing lots of other work, while lots of devices (hard drive controller, video card, audio card, network card, ...) are also using DMA to transfer data around.
CPU can access this memory as if it is DRAM, by memory mapped IO.
You're misusing terminology.
Instead of "DRAM" you should be using the term "main memory", aka system memory or RAM.
On modern computers main/system memory is implemented by some type of SDRAM (synchronous dynamic RAM).
Conflating the functional term (e.g. main memory) with the hardware implementation (e.g. DDR3 SDRAM) seems harmless, but can lead to the false syllogism that "RAM is volatile" or other misunderstandings.
Memory mapping can put the memory/memories of a PCIe device in the same address space as main memory.
CPU can transfer a chunk of data from this PCIe device's memory into real physical memory, via DMA. And then CPU can access the physical memory freely.
"Real physical memory" is redundant. What other types of "physical memory" are there?
There's no "fake physical memory".
You seem to be referring to the use of a buffer in main memory as "DMA".
That is misguided.
DMA is not required in order to employ or copy data to a buffer in main memory.
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
You seem to be misusing terminology.
You might want to study this article on PCIe.
Is it because PCIe bus is slow for random access?
Accessing data from a PCIe device is very slow compared to main/system memory.
This has nothing to do with "random access".
Information (e.g. data retrieval) over the PCIe bus is accomplished with (high-speed) packets (even when the PCIe memory is mapped into processor address space).
And if so, DMA is basically a single dump to speedup frequent random access, and memory-mapped IO is for occasional access?
You're misusing terminology.
If the software is written inefficiently or only needs to use the data just once, then it might access the PCIe memory.
But if the software is going to access the data more than once or deems a "local" copy to be more efficient, then the software could allocate a buffer in main/system memory and copy the data from PCIe memory to main/system memory using either PIO (programmed I/O by the CPU) or DMA (direct memory access by a PCIe bus master or system DMA controller).
The use of buffers is widespread in computers.
A large part of "computing time" is spent on buffering and copying and moving data around.
I/O is almost always performed between a device and a buffer in main memory, even if direct device-to-device transfer is possible.
Do not mislabel the use of a buffer as "DMA".
For some info on DMA, see Why driver need to map DMA buffers when dma-engine is in device?
and
dma vs interrupt-driven i/o .
DMA is usually done by the CPU programming registers on the device mapped to MMIO regions. It wouldn't make sense to map an entire hard drive into physical address space and would quickly use up the available physcial address space on the chipset which is often limited to as low as 39 bit on modern chipsets, so instead only the host controller (xHCI, AHCI etc.) registers are mapped into the MMIO space. It would also mean that the CPU would be using mov commands to copy the data to the hard drive for the entire transfer, which occupies CPU bandwidth. Instead DMA is asynchronous and the CPU issues a command to the device, and the device, PCIe bus, DRAM controller, gets on with it, while the CPU is free.
With a IGPU without dedicated VRAM, you have VRAM in DRAM (GFX stolen memory), which is reserved for the IGPU and is of course accessible by the IGPU and CPU. You also have a GTT page table in DRAM that the IGPU uses to translate internal virtual addresses to physical pages that it then accesses via DMA over the ring bus. The CPU renders there and programs the IGPU to perform DMA to read it in to the IGPU.
On a discrete GPU with VRAM, the CPU writes to DRAM and then inserts then address of the allocation into the GTT table in VRAM via the VRAM aperture, and then programs the GPU to copy from the equivalent GART aperture address that corresponds mathematically to that GTT entry – the aperture is a contiguous GPU device local address space separate to VRAM. The GPU then reads from the aperture space which results in it indexing into the GTT and acquires the real system address of the data and then initiates a DMA transfer from the real system memory address to an arbitrary address in the 256MiB VRAM aperture. There is also the option of using PCIe BARs or resizable BARs to expose a VRAM aperture to which the CPU can directly write to without the need of a copy. Another advantage of this is a cpu core could interleave several transfers, or several cores could work on different transfers, but with DMA, the GPU can likely only perform one DMA transfer at a time sequentially/synchronously with no concurrency or parallelism.

Linux, mmap'ing IOMMU/SMMU registers to userspace

I am working on a register dump utility for debugging.
Just out of curiosity, In Linux, can we mmap the SMMU/IOMMU registers
to userspace ?
I get the below error, when I try to mmap my SMMU/IOMMU address space
"Error mapping physical memory: Cannot allocate memory"
EDIT :
The error was related to mmap'ing memory more than 4Gb on a 32 bit machine, which obviously fails. But the question still stands ..
Can I map SMMU/IOMMU controller registers using mmap ?
I read from here
The mmap() call allows the user application to map a physical device
address range one page at a time or a contiguous range of physical
memory in multiples of page size.
So SMMU/IOMMU controller registers can be mapped using mmap.

How to allocate 4k aligned Memory

malloc() allocates a memory chunk which is virtually contiguous inside the process memory space. malloc() takes a size as a parameter in bytes and returns pointer to that allocated memory space but what if the requirement is to allocate memory which is 4k aligned?
That would almost certainly be achieved using something like posix_memalign.
Since 4Kbytes is often the size of a page (see sysconf(3) with _SC_PAGESIZE or the old getpagesize(2) syscall) you could use mmap(2) syscall (which is used by malloc and posix_memalign) to get 4Kaligned memory.
you can not allocate physically contiguous memory in user space. Because in User space kernel always allocates memory from highmem zone. But if you are writing a kernel module or a system space code then you can use _get_page() or _get_pages().

Why am I getting a high address when I use kmalloc with GFP_DMA in Linux?

I am writing a device driver for a DMA device in Linux. In Linux Device Drivers, Chapter 15, it says:
For devices with this kind of limitation, memory should be allocated
from the DMA zone by adding the GFP_DMA flag to the kmalloc or
get_free_pages call. When this flag is present, only memory that can
be addressed with 24 bits is allocated. Alternatively, you can use the
generic DMA layer (which we discuss shortly) to allocate buffers that
work around your device’s limitations
I am calling kmalloc like this:
physical_pointer0 = kmalloc(number_of_bytes, GFP_DMA);
and printing the result like this:
printk(KERN_INFO "pinmem:pinmen_write kmalloc succeeded. pointer is %p, buffer size is %d\n", physical_pointer0, (unsigned)number_of_bytes);
And this is what I see:
Sep 9 00:29:45 nfellman_lnx kernel: [ 112.161744] pinmem:pinmen_write kmalloc succeeded. pointer is ffff880000180000, buffer size is 320800
How can I be getting a pointer to 0xffff880000180000, which doesn't fit in 24 bits, if I used GFP_DMA?
Could it be that this is not the physical address of my block of memory? If not (which would mean I'm completely misunderstanding kmalloc), how can I get its physical address?
I am working in OpenSuse 12.
The answer to this appears to be that kmalloc doesn't in fact return a physical pointer, but rather a linear pointer, that I must convert to physical with virt_to_phys.
Thanks to Alex Brown for providing the answer here.

Linux kernel device driver to DMA into kernel space

LDD3 (p:453) demos dma_map_single using a buffer passed in as a parameter.
bus_addr = dma_map_single(&dev->pci_dev->dev, buffer, count, dev->dma_dir);
Q1: What/where does this buffer come from?
kmalloc?
Q2: Why does DMA-API-HOWTO.txt state I can use raw kmalloc to DMA into?
Form http://www.mjmwired.net/kernel/Documentation/DMA-API-HOWTO.txt
L:51 If you acquired your memory via the page allocator kmalloc() then you may DMA to/from that memory using the addresses returned from those routines.
L:74 you cannot take the return of a kmap() call and DMA to/from that.
So I can pass the address returned from kmalloc to my hardware device?
Or should I run virt_to_bus on it first?
Or should I pass this into dma_map_single?
Q3: When the DMA transfer is complete, can I read the data in the kernel driver via the kmalloc address?
addr = kmalloc(...);
...
printk("test result : 0x%08x\n", addr[0]);
Q4: Whats the best way to get this to user-space?
copy_to_user?
mmap the kmalloc memory?
others?
kmalloc is indeed one source to get the buffer. Another can be alloc_page with the GFP_DMA flag.
The meaning is that the memory that kmalloc returns is guaranteed to be contiguous in physical memory, not just virtual memory, so you can give the bus address of that pointer to your hardware. You do need to use dma_map_single() on the address returned which depending on exact platform might be no more then wrapper around virt_to_bus or might do more then do (set up IOMMU or GART tables)
Correct, just make sure to follow cache coherency guidelines as the DMA guide explains.
copy_to_user will work fine and is the easiest answer. Depending on your specific case it might be enough or you might need something with better performance. You cannot normaly map kmalloced addresses to user space, but you can DMA into user provided address (some caveats apply) or allocate user pages (alloc_page with GFP_USER)
Good luck!

Resources