Following the text at https://www.kernel.org/doc/Documentation/DMA-API.txt a few inlined questions
Part Ia - Using large dma-coherent buffers
------------------------------------------
void *
dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag)
Consistent memory is memory for which a write by either the device or
the processor can immediately be read by the processor or device
without having to worry about caching effects. (You may however need
to make sure to flush the processor's write buffers before telling
devices to read that memory.)
Q1. Is it safe to assume that the area allocated is cacheable ? As the last line state that flushing is required
Q1a. Does this API allocate memory from lower 16MB which is considered DMA safe.
dma_addr_t
dma_map_single(struct device *dev, void *cpu_addr, size_t size,
enum dma_data_direction direction)
Maps a piece of processor virtual memory so it can be accessed by the
device and returns the physical handle of the memory.
The direction for both api's may be converted freely by casting.
However the dma_ API uses a strongly typed enumerator for its
direction:
DMA_NONE no direction (used for debugging)
DMA_TO_DEVICE data is going from the memory to the device
DMA_FROM_DEVICE data is coming from the device to the memory
DMA_BIDIRECTIONAL direction isn't known
Q2. Does the DMA_XXX options direct change of Page Attributes for the VA=>PA mapping. Say DMA_TO_DEVICE would mark the area as non-cacheable ?
It says "without having to worry about caching effects". That means dma_alloc_coherent() returns uncacheable memory unless the architecture has cache coherent DMA hardware so the caching makes no difference. However being uncached does not mean that writes do not go through the CPU write buffers (i.e. not every memory access is immediately executed or executed in the same order as they appear in the code). To be sure that everything you write into memory is really there when you tell the device to read it, you will have to execute a wmb() at least. See Documentation/memory-barriers.txt for more information.
dma_alloc_coherent() does not return memory from the lower 16 MB, it returns memory that is accessible by the device inside the addressable area specified by dma_set_coherent_mask(). You have to call that as part of the device initialization.
Cacheability is irrelevant to dma_map_*() functions. They make sure that the given memory region is accessible to the device at the DMA address they return. After the DMA is finished dma_unmap_*() is called. For DMA_TO_DEVICE the sequence is "write data to memory, map(), start DMA, unmap() when finished", for DMA_FROM_DEVICE "map(), start DMA, unmap() when finished, read data from memory".
Cache makes no difference because usually you are not writing or reading the buffer while it is mapped. If you really have to do that you have to explicitly dma_sync_*() the memory before reading or after writing the buffer.
Related
Im not sure if I understand the full flow of CPU direct access to memory in ARM processors,
I interested to know which part of memory access the cache (L1 and L2) ,DMA and MMU(or secure MMU) are participate.
I'm not sure if I understand the process sending data from non-secureOS to SecureOS start from allocating shared buffer via DMA and writing data to to share buffer (between secureOS and non-secureOS) and sending.
Additional questions:
Why DMA needed to communicate between secure or non secure ? Why not possible to use via kernel buffer (kmalloc(), kzalloc(), get_page() and etc.)?
Generally, there is possible to CPU access to memory without DMA ? Does DMA must to participate ?
There is possible no-coherency between CPU(cache L1 or L2) to DMA ?
For example:
non-secureOS write own data to DMA buffer and send to secureOS.
secureOS receive the buffer, non-secureOS change the buffer again without flushing (I think the changes keep in the cache) and finally secureOS read stale fake data from the cache
Everything with TrustZone is accomplished with the 'NS' bit that augments the BUS.
For a TrustZone CPU, L1/L2/TLB (via MMU) need to be aware of the 'NS' bit. Caches and TLB are augmented with a 'NS' bit and are not accessible from the normal world if the 'NS' is clear.
I'm not sure if I understand the process sending data from non-secureOS to SecureOS start from allocating shared buffer via DMA and writing data to to share buffer (between secureOS and non-secureOS) and sending.
The secure/non-secure OS have several means to communicate. A DMA buffer is one way, but it is probably complex and would not be a normal mode. The most basic mechanism is the SMC instruction. This is trapped by monitor mode and accomplishes the same thing as a 'syscall'.
Interpret ARM SMC calls
ARM SMC Calling Convention
SMC on StackOverflow
Another way is to map RAM as world shareable. Typically, this is done with a TZASC, but other Trustzone memory controllers may exist on a system. This is probably best 'bootstrapped' via the smc mechanics.
The use of a DMA controller could extend the world shareable memory buffer to off load CPU work load. However, I think this case is a little pathological and would never be done. Even faster than copying the memory via DMA is just to update the TZASC to make a buffer shareable. There is no copying.
Normal world reads 'secure memory' -> faults.
Normal world reads 'world shared memory' -> access as per normal.
The secure OS can flip the TZASC permissions during run time, if the device is not boot locked.
Why DMA needed to communicate between secure or non secure ? Why not possible to use via kernel buffer (kmalloc(), kzalloc(), get_page() and etc.)?
It is as detailed above. It requires world shareable memory.
Generally, there is possible to CPU access to memory without DMA ? Does DMA must to participate?
No DMA does not need to be involved at all. In fact, I wonder what made you think this is the case?
There is possible no-coherency between CPU(cache L1 or L2) to DMA ? For example: non-secureOS write own data to DMA buffer and send to secureOS. secureOS receive the buffer, non-secureOS change the buffer again without flushing (I think the changes keep in the cache) and finally secureOS read stale fake data from the cache
DMA and caches always have coherency issues. TrustZone doesn't add anything new. If you are using DMA, you need to have the MMU set that as device memory and it will not be cached.
Also, the DMA devices themselves are considered BUS masters. They can be TrustZone aware or some front-end logic if placed on them. In the first case, the controller with flip the 'NS' bit based on documented use patterns. For example a crypto device may present banked registers to normal/secure worlds. Depending on who accessed the device, the DMA will be performed with NS set or clear. For the 2nd case, another device/gasket sets up fixed access for the DMA. It is always either normal or secure access. This is often boot locked.
The DMA (and all hardware beside the CPU) are outside the scope of the CPU. The SOC designer and OEM have to configure the system to match the security requirements of the application. So different devices should map to normal/secure (or dynamic if required). The safest case is to fix these mappings and lock them at boot time. Otherwise, your attack surface grows in attacks against TrustZone.
I have buffer coming in from the user space which needs to be filled with device registers as a debugging mechanism. Is it safe to use copy_to_user() / copy_from_user() for device memory? If not, what's the best alternative given that the device driver lies in kernel space?
All the comments are wrong.
For any data moves between user and kernel spaces, you have to use copy_from/to_user
memcpy_from/toio() are reserved for addresses IN the kernel space and MMIO. It's unsafe to use those functions with user-space addresses.
Answer:
You can simply use copy_from/to_user() directly with the mapped MMIO address in void * to or void * from. So that you don't need a useless intermediate buffer.
To be used only with prefetchable memory since it might read/write several times the same memory and/or in an unordered way.
I have a custom device driver that implements an mmap operation to map a shared RAM buffer (outside of the OS) to userspace. The buffer is reserved by passing mem=32M as a boot argument for the OS, leaving the rest of the 512MB available as a buffer. I would like to perform zero-copy operations from the mapped memory, which is not possible if the vm_flags include VM_PFNMAP and VM_IO.
My driver currently performs the mapping by calling vm_iomap_memory(vma, start, size), which in turn calls io_remap_pfn_range and remap_pfn_range, which sets up the vma with the VM_PFNMAP and VM_IO set. This works to map the memory to userspace, but zero-copy socket operations fail at get_user_pages due either to the VM_PFNMAP flags being set or the struct page being missing. The comments for remap_pfn_range show this is intended behavior, as pfn-mapped memory should not be treated as 'normal'. However, for my case it is just a block of reserved RAM, so I don't see why it should not be treated as normal. I have set up cache invalidation/flushing to manually manage the memory.
I have tried unsetting the VM_PFNMAP and VM_IO flags on the vm_area_struct both during and after the mapping, but get_user_pages still fails. I have also looked at the dma libraries but it looks like they rely on a call to remap_pfn_range behind the scenes.
My question is how do I map physical memory as a normal, non-pfn, struct page-backed userspace address? Or is there some other way I should be looking at it? Thanks!
I've found the solution to mapping a memory buffer outside the Kernel that requires a correction to several wrong starting points that I mentioned above. It's not possible to post full source code here, but the steps to get it working are:
Device tree: Define reserved memory region for buffer with no associated driver. Do not use mem or memmap bootargs. Kernel will confine itself to using memory outside of this reserved space for itself, but now will be able to make struct pages for reserved memory.
In a device driver (a LKM in my case), mapping physical address to kernel virtual address requires using using memremap instead of ioremap, as it is real memory we are mapping.
In device driver mmap routine, do not use any variant of remap_pfn_range to setup the vma for usespace, instead assign a custom fault nopage routine to the vma->vm_ops.fault to look up the page when the userspace virtual address is used. This approach is described in lddv3 ch15.
The nopage function in the driver should use the vm_fault structure argument that is passed to it to calculate the offset into the vma for the address that needs a page. Then use that offset to calculate an kernel virtual address (against the memremap'd address), and get the page with a call to page = virt_to_page(pageptr);, followed by a call to get_page(page);, and assign it to the vm_fault structure with vmf->page = page; The latter part of this is illustrated in lddv3 chapter 15 as well.
The memory mapped in this fashion using mmap against the custom device driver can be used just like normal malloc'd memory as far as I can tell. There are probably ways to achieve a similar result with the DMA libraries, but I had constraints preventing that route, or associating the device tree node with the driver.
How does an IO device know that a value in memory pertaining to it has changed in memory mapped IO?
For example, let's say memory address 0 has been dedicated to hold the background color for a VGA device. How does the VGA device know when we change the value in memory[0]? Is the VGA device constantly polling the memory location? Or does the CPU somehow notify the device when it changes the value (and if so how?)?
An example architecture is MIPS. Given that the MIPS instruction set does not have in or out instructions, I don't understand how it could possibly communicate (on change) with the VGA device in the example. Another example is the ARM architecture.
In memory-mapped I/O, performing a memory read/write to the device's memory region will cause the CPU to perform a transaction with the device to fetch/store that value -- either directly through the CPU's memory bus, or through a secondary bus (such as AHB/APB on ARM systems). This memory transaction directly notifies the device that a value is being changed; no separate notification is necessary.
You're assuming that memory-mapped I/O is mapped by normal RAM. This is not the case. Indeed, these devices may behave in ways which are entirely unlike real memory! For instance, a typical UART or SPI device implementation may have a single data register which can be written to to transmit data, or read from to retrieve received data. Similarly, it's not uncommon for interrupt registers to have "clear on read" or "write 1 to clear" semantics.
For what it's worth: in practice, many framebuffer graphics implementations do actually behave as normal memory. What's different is that the memory is stored in a dual-ported RAM (or a time-multiplexed bus), and the video RAMDAC continuously reads through that memory to transmit its contents to an attached display.
A region of the physical address space that is designated as memory-mapped I/O (MMIO) is not mapped to main memory (system memory); it's mapped to I/O registers which are physically part of the I/O device.
To determine how to handle a memory access (read or write), the processor checks first the type of the region to which the target memory address belongs. In any MIPS processor, there are at least two types: Uncached and Cached. MMIO regions are always Uncached. An Uncached memory access request is directly sent to the main memory controller without examining or affecting any of the caches. However, an I/O Uncached memory access request is sent to an I/O controller, and eventually the request will reach the destination I/O device.
Now exactly how the CPU and the I/O device communicate with each other is completely specified by the I/O device itself. So an I/O device would have a specification that discusses how many I/O registers there are and how each of them should be used. An I/O register could be used to hold status flags, control flags, data to be read or written by the CPU, or some combination thereof. Note that since the I/O registers are physically part of the I/O device, then the I/O device can be designed so that it can detect when any of its registers are being read from or written to and take an action accordingly if required.
An I/O device can send an interrupt to the CPU to inform it that some data is available or maybe it wants attention for whatever reason. The CPU can also frequently poll the I/O device by checking some status flag(s) and then take some action accordingly.
What is the difference between DMA and memory-mapped IO? They both look similar to me.
Memory-mapped I/O allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.
DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.
Here is a paper has a thorough comparison between MMIO and DMA.
Design Guidelines for High Performance RDMA Systems
Since others have already answered the question, I'll just add a little bit of history.
Back in the old days, on x86 (PC) hardware, there was only I/O space and memory space. These were two different address spaces, accessed with different bus protocol and different CPU instructions, but able to talk over the same plug-in card slot.
Most devices used I/O space for both the control interface and the bulk data-transfer interface. The simple way to access data was to execute lots of CPU instructions to transfer data one word at a time from an I/O address to a memory address (sometimes known as "bit-banging.")
In order to move data from devices to host memory autonomously, there was no support in the ISA bus protocol for devices to initiate transfers. A compromise solution was invented: the DMA controller. This was a piece of hardware that sat up by the CPU and initiated transfers to move data from a device's I/O address to memory, or vice versa. Because the I/O address is the same, the DMA controller is doing the exact same operations as a CPU would, but a little more efficiently and allowing some freedom to keep running in the background (though possibly not for long as it can't talk to memory).
Fast-forward to the days of PCI, and the bus protocols got a lot smarter: any device can initiate a transfer. So it's possible for, say, a RAID controller card to move any data it likes to or from the host at any time it likes. This is called "bus master" mode, but for no particular reason people continue to refer to this mode as "DMA" even though the old DMA controller is long gone. Unlike old DMA transfers, there is frequently no corresponding I/O address at all, and the bus master mode is frequently the only interface present on the device, with no CPU "bit-banging" mode at all.
Memory-mapped IO means that the device registers are mapped into the machine's memory space - when those memory regions are read or written by the CPU, it's reading from or writing to the device, rather than real memory. To transfer data from the device to an actual memory buffer, the CPU has to read the data from the memory-mapped device registers and write it to the buffer (and the converse for transferring data to the device).
With a DMA transfer, the device is able to directly transfer data to or from a real memory buffer itself. The CPU tells the device the location of the buffer, and then can perform other work while the device is directly accessing memory.
Direct Memory Access (DMA) is a technique to transfer the data from I/O to memory and from memory to I/O without the intervention of the CPU. For this purpose, a special chip, named DMA controller, is used to control all activities and synchronization of data. As result, compare to other data transfer techniques, DMA is much faster.
On the other hand, Virtual memory acts as a cache between main memory and secondary memory. Data is fetched in advance from the secondary memory (hard disk) into the main memory so that data is already available in the main memory when needed. It allows us to run more applications on the system than we have enough physical memory to support.