How to transfer data via DMA from RAM to RAM?

How to transfer data via DMA from RAM to RAM? - linux

I want to write a kernel module that can transfer data via DMA from RAM to RAM. There are some posts that discuss this, but I don't really get it. Some say it is possible others say it isn’t.
If I understood ldd3 right, RAM to RAM copying isn‘t possible with the DMA API of linux, but the driver/dma/dmaengine.c provides a flag DMA_MEMCPY for a "DMA Transfer Type", so there should be a way.
Is this correct, can I use a dma engine to transfer data from one ram address to another?
If it is hardware dependent, how can I determine if my system supports dma memcpy?

As you correctly pointed out, DMA_MEMCPY should be used to perform RAM-to-RAM copy. It is described in Documentation/dmaengine/provider.txt. Here is just a related excerpt, please see the whole file for more details:
Supported transaction types
The next thing you need is to set which transaction types your device
(and driver) supports.
Our dma_device structure has a field called cap_mask that holds the
various types of transaction supported, and you need to modify this
mask using the dma_cap_set function, with various flags depending on
transaction types you support as an argument.
All those capabilities are defined in the dma_transaction_type enum,
in include/linux/dmaengine.h
Currently, the types available are:
DMA_MEMCPY
The device is able to do memory to memory copies
Just to summarize:
It depends on your DMA controller. Some are able to do RAM-to-RAM transaction, some aren't.
For example, for OMAP-based SoCs, DMA controller does this (drivers/dma/omap-dma.c file, in omap_dma_probe() function):
dma_cap_set(DMA_MEMCPY, od->ddev.cap_mask);
This way you can later check it (if your DMA controller is capable of RAM-to-RAM transactions) in your driver. See how it's done in drivers/dma/dmatest.c, in dmatest_add_channel() function:
if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask)) {
If you need an example on how to use DMA API to perform RAM-to-RAM transaction, please see drivers/dma/dmatest.c.

Related

How are memory regions on ARM Cortex A denoted as "device" or "strongly ordered" under Linux

On the ARM Cortex-A9 that comprises part of the Zynq SoC I'm using, regions of memory are labelled as "normal", "device" or "strongly ordered". This is described in the Zynq technical reference manual, but I understand it is a property of ARMs more generally. Obviously, the ability to have strongly ordered memory accesses for memory mapped devices (which includes many in FPGA fabric) should simplify the software somewhat, so is desirable to set up.
I'm using the UIO driver for mapping the device memory into userspace, in which the bulk of the driver runs. According to this reference the UIO driver sets up its mapped memory as "device/strongly ordered". Unfortunately, this is the only reference I can find to this, and before I start ripping out memory fences from my code, I'd like to have a little more confidence about what is going on.
It's not clear to me currently how the Linux kernel denotes memory regions of a particular type. It seems to me that the MT_* properties denote something along these lines, but I can't find the definitions of each type. Nor can I work out how the UIO driver specifies the particular memory.
Any pointers about how the memory properties are set in Linux, either in general terms or ideally with reference to UIO would be exceptionally helpful. I'm happy to have that in the form of a pointer to documentation.

There are a few parts to this.
On ARMv7, which includes the Zynq-7000, memory is denoted as a given type by the memory region attributes configured through the translation table descriptors. There are various ways to configure these and the mechanisms are described in section B3.8 of the ARM Architecture Reference Manual ARMv7-A. Also useful is the Zynq technical reference manual, which is less complete as regards the ARM, but easier to process.
Broadly, the bits of interest are the B (bufferable) bit, the C (cacheable) bit and the 3 TEX (type extension) bits. These may be set directly, or through a redirection when SCTLR.TRE bit is set (which effectively allows a custom remapping using the PRRR and NMRR registers - there may be more to it than that, but I can't immediately see what).
The translation table descriptors are set up in Linux in the memory-management unit (MMU) subsystem. This is obviously very architecture specific, and the relevant ARM bits are found in arch/arm/mm. It's interesting to look through mmu.c to see how different memory attributes are configured on different types.
What follows is a little more speculative, but I think is accurate.
The core UIO driver sets up its relevant memory protections on a physical device by calling pgprot_noncached(). Now, I think this is delegated to the architecture specific implementation, which in the case of ARMv7 is a macro defined in arch/arm/include/asm/pgtable.h, and resolves to setting the L_PTE_MT_UNCACHED flag.
The L_PTE_MT_UNCACHED constant is in turn set in arch/arm/include/asm/pgtable-2level.h. There is a nice bit of documentation in that file that describes what the various constants represent. The value for each type is remapped to the B, C and TEX bits, either through the TRE redirection or through a look-up table configured in arch/arm/mm/proc-macros.S. The TRE redirection registers (PRRR and NMRR) I think are configured in arch/arm/mm/proc-v7-2level.S. If you track those through, you get the same values as the look-up table (which references constants defined in arch/arm/include/asm/pgtable-2level-hwdef.h - note that those constants are for a small page table descriptor, distinct to the ones used in mmu.c)
Where does this leave us? The UIO driver configuring a piece of memory as pgprot_noncached() implies it is L_PTE_MT_UNCACHED, which in turn implies TEX = 000, B = 0 and C = 0. Looking those settings up in the reference manual, we see this corresponds to an unbuffered, strongly ordered, shareable memory region (denoted "Strongly-typed").
It's apparent, that to change the device (perhaps to allow buffered writes), we'd need to modify the driver configuration of memory to use e.g. pgprot_writecombine() which would set the memory type to be normal but with the cache off. There is also a pgprot_device() macro, which also sets up device memory type (bufferable) and cache off, but with some additional flags I haven't understood properly yet (I think they are for configuring a software "Linux" version of the page table entry for the case where the hardware doesn't support it, so aren't relevant when it does).

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?

Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.

I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.

Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Best Way to copy data from Kernel Driver to user space driver

I am having a kernel driver which is currently reading data from a sensor.
Now I have to write a user space application which will call kernel's sensor_read() API and send data to cloud.
How can I expose kernel's sensor_read() call to user space and read this data from user space? Data is about 10 bytes.

How can I expose kernel's sensor_read() call to user space and read this data from user space?
Most likely you should use IIO kernel framework, as it's specifically designed for writing sensor drivers. IIO exposes necessary files for your driver (in /sys/bus/iio/ and /dev/iio*). You can read() those files, or poll() them (to handle interrupts).
Official documentation is available here. You can also use some existing drivers as a reference, look here: drivers/iio/ .
Before IIO framework was introduced, it was common to provide sysfs files for drivers manually. So if you use old enough kernel, that should be the way to write the driver: handle your bus (like I2C) and sysfs files manually. But still, the best way is to use new kernel and IIO.
I am working on a Gyro + accel sensor. Linux driver will be sending events of type EV_MSC for both of them
It's not unusual for chip to have more than one sensor on it. In that case you should create two different drivers: one for accelerometer, and one for gyro. This way you will have two different files, one file per sensor.
For example, look how it's done for LSM330DLC chip (accelerometer + gyro):
accel driver: drivers/iio/accel/st_accel_core.c
gyro driver: drivers/iio/gyro/st_gyro_core.c
Both drivers are calling iio_device_register() function from driver's probe function, which creates corresponding files (which you can read/poll). Refer to documentation for further details.
As per my understanding I will open both the input devices from user space and add then to list of FD's which we want to poll. So when there is a new event how I can determine whether this event is from Gyro or aceel?
If you are curious about how to handle two /dev/input/event* files in user space, you basically have two choices:
using blocking I/O: you can read/poll them in different threads
using non-blocking I/O: you can open those files as O_NONBLOCK and just read() them in the same one thread; if new data is not available yet -- it will return -1 and errno will be set to EAGAIN; you can do those reads in infinite loop, for instance
This answer contains example of how to handle input file in C. And here you can read about blocking/non-blocking I/O.

There are many ways to access sensors data from kernel space to userspace
Check for relevant driver of the sensor you used. Check whether it supports/provides sysfs support.
you can read the data from /sys/class/ interfaces. You need to ensure that the relevant parameters exported to sysfs.
For eg: Temperature sensors should have exported temperature values(equivalent factors) in sysfs entry.
Examples(Below examples are fictions only)
cat /sys/class/hwmon/tempsensor/value
cat /sys/class/hwmon/tempsensor/min_value
cat /sys/class/hwmon/tempsensor/max_value
In some drivers, you can read them through ioctl / read / write api's to read/write sensor data.

You can use IOCTL (ioctl/read) calls to access the kernel functions from User space.
Refer below link for sample:
http://www.tldp.org/LDP/lkmpg/2.4/html/x856.html

PCIe device discovery algorithm pseudo code

I have a PCIe model written in System Verilog, although I think this question is language agnostic. The model performs PCIe configuration reads and writes and memory reads and writes perfectly in simulation. However, what I need to do is "discover" my PCIe device and configure my config space registers in simulation. Is there a boiler plate chunk of pseudo code that represents the Linux PCIe enumeration process that I can just add my own models transactions functions too so that I can get a "Bus walk", followed by BAR programming, SR-IOV enable if discovered, MSIx config? It seems like this would be a common exercise for PCIe device so maybe there is model.

It isn't terribly difficult to do. Basically you loop through the config space, checking for each each possible device on the first root bus 0. When a device is found, you allocate a memory space for it based on its requested size and program the BARs accordingly. If you find any bridges, you also configure and enable them - the basic bridge registers for this are standard. This includes assigning the upstream and downstream bus numbers, which then allows you to enumerate the new downstream bus, and so on.
I had to do this once to access a PCI I/O card on a system that had no OS or other software environment. It wasn't too bad and that was across two bridges from two vendors, as well as the I/O card registers and the CPU bus root bridge setup. This was PCI, not PCIe, but it would be very much the same. You could even do it with completely hard-coded numbers if the hardware never changed, but in my case there were a couple variants so I actually had to do some simple enumeration to find the device numbers dynamically. One gotcha is that you may have to delay a bit, or retry, to give all the devices time to come online before you try to access them.
In doing that I found this book to be invaluable: PCI System Architecture (4th Edition). I notice there is also an version for PCIe: PCI Express System Architecture (1st Edition). I would definitely get one of those if you haven't already. These books contain detailed algorithms and explanations about how to do all of this. At the time I didn't really use or refer to any code to speak of, but...
The best code resource I have found is U-Boot. It operates at a similarly low-level and is totally self contained and is still fairly small and as simple as possible. For example, the enumeration appears to start with the function pci_init() calls a board specific pci_xxx_init(). This then sets up the root bridge and then calls pci_hose_scan_bus() in drivers/pci/pci.c to do the real work. Also check out the routines in drivers/pci/pci_auto.c, as well as the rest of the folder.
For your task you probably only need a very small subset and could just hack out parts of these files into a simple driver. Basically a for() loop and some pci_read/write_config() calls with logic to recognize your device and bridge IDs.

How to combine multiple struct BIOs into a single struct request?

I'm working on Linux kernel version 2.6.39.1, and am developing a block device driver. In this regard, I want to combine multiple struct bios into a single struct request, which is then added to the request_queue for processing by the device driver, namely -- scsi_request_fn().
I tried using the ->bi_next field of struct bio to link multiple struct bios that I have composed, thereby creating a linked list of struct bios. When I call submit_bio() to submit a bio to the block device layer for I/O, this BUG_ON() is triggered because the code expects bio->bi_next to be NULL.
Is there a way to link several struct bios into a single struct request before sending it to lower layers for servicing?

I'm not sure how to string multiple struct bio together, but you might want to take a look at the "task collector" implementation in libsas and the aic94xx driver for an alternate approach. There isn't much documentation, but the libsas documentation describes it as
Some hardware (e.g. aic94xx) has the capability to DMA more
than one task at a time (interrupt) from host memory. Task
Collector Mode is an optional feature for HAs which support
this in their hardware. (Again, it is completely optional
even if your hardware supports it.)
In Task Collector Mode, the SAS Layer would do natural
coalescing of tasks and at the appropriate moment it would
call your driver to DMA more than one task in a single HA
interrupt. DMBS may want to use this by insmod/modprobe
setting the lldd_max_execute_num to something greater than 1.
Effectively, this lets the block layer (a.k.a. BIO) remain unchanged, but multiple requests are accumulated at the driver layer and submitted together.

Thanks for the reply, #ctuffli. I've decided to use a structure similar to the one described here. Basically, I allocate a struct packet_data which would contain pointers to all struct bios that should be merged to form one single struct bio (and later on, one single struct request). In addition, I store some driver related information as well in this struct packet_data. Next, I allocate a new struct bio (lets call it "merged_bio"), copy all the pages from the list of original BIOs and then make the merged_bio->bi_private point to the struct packet_data. This last hack would allow me to keep track of the list of original BIOs, and also call bio_endio() to end I/O on all individual BIOs once the merged_bio has been successfully transferred.
Not sure if this is the smartest way to do this, but it does what I intended! :^)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string