Mapping device memory for Linux 2.6.30 DMA API

Mapping device memory for Linux 2.6.30 DMA API - linux

I've been struggling with this one, would really appreciate some help. I want to use the internal SRAM (stepping stone - not used after boot) of my At91sam9g45 to speed up some intensive computations and am having trouble meeting all the following conditions:
Memory is accessible from user space. This was easy using the user space mmap() and then kernel remap_pfn_range(). Using the pointer returned, my user space programs can read/write to the SRAM.
Using the kernel DMA API call dma_async_memcpy_buf_to_buf() to do a memcpy using DMA. Within my basic driver, I want to call this operation to copy data from DDR( allocated with kmalloc()) into the SRAM buffer.
So my problem is that I have the user space and physical addresses, but no kernel-space DMA API friendly mapping.
I've tried using ioremap and using the fixed virutal address provided to iotable_init(). None of these seems to result in a kernel virtual address that can be used with something like virt_to_bus (which works for the kmalloc addresses and i think is used within the DMA API).
There's way around and thats just triggering the DMA manually using the physical addresses, but I'd like to try and figure this out. I've been reading through LDD3 and googling, but i can't see any examples of using non-kmalloc memory for the DMA API (except for PCI buses).

Related

How can I force a cache flush for a process from a Linux device driver?

I'm working on a research project that requires me to perform a memory capture from custom hardware. I am working with a Zedboard SoC (dual-core ARM Cortex-A9 with FPGA fabric attached). I have designed a device driver that allows me to perform virtual memory captures and physical memory captures (using an AXI4-Lite peripheral that controls the Xilinx AXI DMA IP).
My goal is to capture all mapped pages, so I check /proc/pid/maps for mapped regions, then obtain PFNs from /proc/pid/pagemaps, pass the physical addresses into my device driver, and then pass them to my custom hardware (which invokes the Xilinx AXI DMA to obtain the contents from physical memory).
NOTE: I am using Xilinx's PetaLinux distribution, which is built on Linux version 4.14.
My device driver implements the following procedure through a series of IOCTL calls:
Stop the target process.
Perform virtual memory capture (using the access_process_vm() function).
Flush the cache (using the flush_user_range() function).
Perform physical memory capture.
Resume the target process.
What I'm noticing, however, is that the virtual memory capture and the physical memory capture differ in the [heap] section (which is the first section that extends past one page). The first page matches, but none of the other pages are even close. The [stack] section does not match at all. I should note that for the first two memory sections, .text and .rodata, the captures match exactly. The conclusion for now is that data that does not change during runtime matches between virtual and physical captures while data that does change during runtime does not match.
So this leaves me wondering: am I using the correct function to ensure coherency between the cache and the RAM? If not, what is the proper function to use to force a cache flush to RAM? It is necessary that the data in RAM is up-to-date to the point when the target process is stopped because I cannot access the caches from the custom hardware.
Edit 1:
In regards to this question being marked as a possible duplicate of this question, I am using a function from the accepted answer to initiate a cache flush. However, from my perspective, it doesn't seem to be working as the physical memory does not match the virtual memory as I would expect if a cache flush were occurring.

For anyone coming across this question in the future, the problem was not what I thought it was. The flush_user_range() function I mentioned is the correct function to use to push pages back to main memory from cache.
However, what I didn't think about at the time was that pages that are virtually contiguous are not necessarily (and are very often not) physically contiguous. In my kernel code, I passed the length of the mapped region to my hardware, which requested that length of data from the AXI DMA. What I should have done was a virtual-to-physical translation to get the physical address of each page in each region, then requested one page length of data from main memory, repeating that process for each page in each mapped region.
I realize this is a pretty specific problem that likely won't help anyone else doing the same thing I was doing, but hopefully the learned lesson can help someone in the future: Linux allocates pages in physical memory (that are usually 4kB in size, though you shouldn't assume that's the case), and a collection of physical pages is contained in one mapped region. If you're working on any code that requires you to check physical memory, be sure to be wary of where your data may cross physical page boundaries and act accordingly.

Is msync always really needed when writing to /dev/mem?

I am using mmap to open /dev/mem for read/write into UART registers.
It works well but my question is :
After a write, is the msync system call with MS_SYNC flag really needed ?
From my understanding, /dev/mem is a virtual device than provide access to physical memory zones (UART registers in my case ) by translating virtual memory address and so give access to some physical memory from user space.
This is not a common file and i guess that modifications of registers are not buffered/cached. I actually would like to avoid this system call for performance reasons.
Thanks

My understanding is that msync() is needed to update the data in a normal file that is modified through a mapping created with mmap().
But when you use mmap on /dev/mem you are not mapping a normal file on disk, you are just mapping the desired hardware memory range directly into your process virtual address space, so msync() is off topic, it will do nothing.
The only thing that lies between your writing into your mmapped virtual space and the hardware device is the CPU cache. To force that you can force a cache flush (__clear_cache() maybe?), but that is usually unnecessary because the kernel identifies the memory mapped device register and disables the cache for that range. In X86 CPUs that is usually done with MTRR, but with ARM I don't know the details...

Using a large buffer for DMA on a linux x86_64 system

I am writing a device driver for a Fibre Channel card that will move large amounts of data. The card can act as PCI master and will DMA the data into system memory. This is on an x86_64 linux system running kernel 3.0.35.
I first tried allocating the buffers using kmalloc(), but found that I could not allocate buffers large enough. However, as a learning exercise, I continued with the development using small buffers allocated with kmalloc(). I got the driver to work in this case. I allocated small buffers with kmalloc() (and the GFP_DMA flag), passed the address returned by kmalloc to dma_map_single(), and returned the address returned to the registers on the card.
I am now trying to modify the driver to use large buffers - on the order of 200MB. I have reserved a block of memory at boot time using the mem= kernel parameter. I map the reserved memory using ioremap(). I have mmap-ed the memory to user space and I have verified that the user application can write to the mmap-ed region and that the driver can read the data using the virtual address returned by ioremap(). But I have not been able to get the board to DMA data to my buffers.
I have tried to map the region that I allocated with dma_map_single(). I first passed the virtual address returned from ioremap(). I thought this was the correct address because the DMA-API-HOWTO file says:
"the driver can give a virtual address X to
an interface like dma_map_single(), which sets up any required IOMMU
mapping and returns the DMA address Z. The driver then tells the device to
do DMA to Z, and the IOMMU maps it to the buffer at address Y in system
RAM."
But I got no data from the card. I have also tried passing dma_map_single() the physical address of the region, but that didn't work. I have tried writing the address returned by dma_map_single() to the cards registers, and I have tried writing the physical address to the registers. Neither of those efforts worked, either.
Is there a step that I'm missing? Is my procedure in this case completely wrong? Can a card even perform DMAs to a memory region like the one I have reserved on an x86_64?
There is a possibility that I could upgrade the system to a 3.12.28 kernel. I understand that that kernel supports CMA. Would CMA work in this case?
Any help is appreciated.

Write to a cacheable physical address in linux kernel without using ioremap or mmap

I am changing the linux kernel scheduler to print the pid of the next process in a known physical memory location. mmap is used for userspace programs while i read that ioremap marks the page as non-cacheable which would slowdown the execution of the program. I would like a fast way to write to a known physical memory. phys_to_virt is the option that i think is feasible. Any idea for a different technique.
PS: i am running this linux kernel on top of qemu. the physical address will be used by qemu to read information sent by guest kernel. writing to a known io-port is not feasible since the device code backing this io-device will be called every time there is an access to the device.
EDIT : I want the physical address location of the pid to be safe. How can I make sure that a physical address that the kernel is using is not being assigned to any process. As far as my knowledge goes, ioremap would mark the page as cacheable and would hence not be of great use.

The simplest way to do this would be to do kmalloc() to get some memory in the kernel. Then you can get the physical address of the pointer that returns by passing it to virt_to_phys(). This is a total hack but for your case of debugging / tracing under qemu, it should work fine.
EDIT: I misunderstood the question. If you want to use a specific physical address, there are a couple of things you could do. Maybe the cleanest thing to do would be to modify the e820 map that qemu passes in to mark the RAM page as reserved, and then the kernel won't use it. (ie the same way that ACPI tables are passed in).
If you don't want to modify qemu, you could also modify the early kernel startup (around arch/x86/kernel/setup.c probably) to do reserve_bootmem() on the specific physical page you want to protect from being used.
To actually use the specified physical address, you can just use ioremap_cache() the same way the ACPI drivers access their tables.

It seems I misunderstood the cache coherency between VM and host part, here is an updated answer.
What you want is "virtual adress in VM" <-> "virtual or physical adress in QEMU adress space".
Then you can either kmalloc it, but it may vary from instance to instance,
or simply declare a global variable in the kernel.
Then virt_to_phys would give you access to the physical address in VM space, and I suppose you can translate this in a QEMU adress space. What do you mean by "a physical address that the kernel is using is not assigned to any process ?" You are afraid the page conatining your variable might be swapped ? kmalloced memory is not swappable
Original (and wrong) answer
If the adress where you want to write is in it's own page, I can't see how an ioremap
of this page would slow down code executing in a different page.
You need a cache flush anyway, and without SSE, I can't see how you can bypass the cache if MMU and cache are on. I can see only this two options :
ioremap and declare a particular page non cacheable
use a "normal" address, and manually do a cache flush each time you write.

Linux kernel device driver to DMA into kernel space

LDD3 (p:453) demos dma_map_single using a buffer passed in as a parameter.
bus_addr = dma_map_single(&dev->pci_dev->dev, buffer, count, dev->dma_dir);
Q1: What/where does this buffer come from?
kmalloc?
Q2: Why does DMA-API-HOWTO.txt state I can use raw kmalloc to DMA into?
Form http://www.mjmwired.net/kernel/Documentation/DMA-API-HOWTO.txt
L:51 If you acquired your memory via the page allocator kmalloc() then you may DMA to/from that memory using the addresses returned from those routines.
L:74 you cannot take the return of a kmap() call and DMA to/from that.
So I can pass the address returned from kmalloc to my hardware device?
Or should I run virt_to_bus on it first?
Or should I pass this into dma_map_single?
Q3: When the DMA transfer is complete, can I read the data in the kernel driver via the kmalloc address?
addr = kmalloc(...);
...
printk("test result : 0x%08x\n", addr[0]);
Q4: Whats the best way to get this to user-space?
copy_to_user?
mmap the kmalloc memory?
others?

kmalloc is indeed one source to get the buffer. Another can be alloc_page with the GFP_DMA flag.
The meaning is that the memory that kmalloc returns is guaranteed to be contiguous in physical memory, not just virtual memory, so you can give the bus address of that pointer to your hardware. You do need to use dma_map_single() on the address returned which depending on exact platform might be no more then wrapper around virt_to_bus or might do more then do (set up IOMMU or GART tables)
Correct, just make sure to follow cache coherency guidelines as the DMA guide explains.
copy_to_user will work fine and is the easiest answer. Depending on your specific case it might be enough or you might need something with better performance. You cannot normaly map kmalloced addresses to user space, but you can DMA into user provided address (some caveats apply) or allocate user pages (alloc_page with GFP_USER)
Good luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string