Linux PCIe DMA Driver (Xilinx XDMA) - linux

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?

Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.

I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.

Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Related

Is it possible to drop down packets

I am trying to write some sort of very basic packet filtering in Linux (Ubuntu) user space.
Is it possible to drop down packets in user space via c program using raw socket (AF_PACKET), without any kernel intervention (such as writing kernel module) and net filtering?
Thanks a lot
Tali
It is possible (assuming I understand what you're asking). There are a number of "zero-copy" driver implementations that allow user-space to obtain a large memory-mapped buffer into which (/ from which) packets are directly DMA'd.
That pretty much precludes having the kernel process those same packets though (possible but very difficult to properly coordinate user-space packet sniffing with kernel processing of the same packets). But it's fine if you're creating your own IDS/IPS or whatever and don't need to "terminate" connections on the local machine.
It would definitely not be the standard AF_PACKET; you have to either create your own or use an existing implementation: look into netmap, DPDK, and PF_RING (maybe PF_RING/ZC? not sure). I worked on a couple of proprietary implementations in a previous career so I know it's possible.
The basic idea is either (1) completely duplicate everything the driver is responsible for -- that is, move the driver implementation completely into user space (DPDK basically does this). This is straight-forward on paper, but is a lot of work and makes the driver pretty much fully custom.
Or (2) modify driver source so that key network buffer allocation requests get satisfied with an address that is also mmap'd by the user-space process. You then have the problem of communicating buffer life-cycles / reference counts between user-space and kernel. That's very messy but can be done and is probably less work overall. (I dunno -- there may be a way to automate this latter method if you're clever enough -- I haven't been in this space in some years.)
Whichever way you go, there are several pieces you need to put together to do this right. For example, if you want really high performance, you'll need to use the adapter's "RSS" type mechanisms to split the traffic into multiple queues and pin each to a particular CPU -- then make sure the corresponding application components are pinned to the same CPU.
All that being said, unless your need is pretty severe, you're best staying with plain old AF_PACKET.
You can use iptable rules to drop packets for a given criteria, but dropping using packet filters is not possible, because the packet filters get a copy of the packet while the original packet flows through usual path.

Single buffer; multiple sockets; single syscall under Linux

Does Linux have any native kernel facility that enables send()ing a supplied buffer to a set of sockets? A sort-of vectored I/O, except for socket handles rather than for a buffer.
The objective being to reduce the number of u/k transitions involved in situations where-in for example you need to broadcast some state update to n clients which would require iterating through each socket and sending.
One restriction is that TCP sockets must be supported (not under my control)
The answer is no, neither linux nor posix systems have the call you want. I fear that you don't get any advantage of having it, as each of the data streams will follow different paths and that makes copying the buffers in kernel than in user space. Not making copies in user-to-kernel doesn't neccesarily mean doing in kernel mode is better.
Either way, in linux you can implement this kind of mwrite (or msend) system call, as you have the source code. But I'm afraid you won't get anything but a head pain. The only approach to this implementation is some kind of copy-on-divert method, but I don't think you'll get any advantage.
Finnaly, once you have finished the first write(2) call, the probability of having to swap in the user buffer again in the next is far too low, making the second and next copies of the buffer will have very low overhead, as the pages will be all in core memory. You need a very high loaded system to get the user buffer swapped out in the time between syscalls.

Control V4L2/VB2 Buffer Allocation?

I am trying to write a V4L2 compliant driver for a special camera device i have, but the device doesn't seem particularly friendly with V4L2's buffer system. Instead of the separately allocated buffers, it wants a single contiguous block of memory capable of holding a set # of buffers (usually 4), and it then provides a status register telling you which is the latest (updated after each frame is DMA'ed to the host). So it basically needs only a single large DMA allocated memory chunk to work with, not 4 most-likely separated.
How can I use this with V4L? Everything I see about VIDIOC_CREATE_BUFS, VIDIOC_REQBUFS and such does internal allocation of the buffers, and I can't get anything V4L-based (like qv4l2 to work without a successful QBUF and DQBUF that uses their internal structure.
How can this be done?
Just for completion, I finally found a solution in the "meye" driver. I removed everything VB2 and wrote my own reqbuf, querybuf, qbuf, and dqbuf, along with my own mmap routines to handle the allocation. And it all works!

How to ensure read() to read data from the real device each time?

I'm periodically reading from a file and checking the readout to decide subsequent action. As this file may be modified by some mechanism which will bypass the block file I/O manipulation layer in the Linux kernel, I need to ensure the read operation reading data from the real underlying device instead of the kernel buffer.
I know fsync() can make sure all I/O write operations completed with all data written to the real device, but it's not for I/O read operations.
The file has to be kept opened.
So could anyone please kindly tell me how I can do to meet such requirement in Linux system? is there such a API similar to fsync() that can be called?
Really appreciate your help!
I believe that you want to use the O_DIRECT flag to open().
I think memory mapping in combination with madvise() and/or posix_fadvise() should satisfy your requirements... Linus contrasts this with O_DIRECT at http://kerneltrap.org/node/7563 ;-).
You are going to be in trouble if another device is writing to the block device at the same time as the kernel.
The kernel assumes that the block device won't be written by any other party than itself. This is true even if the filesystem is mounted readonly.
Even if you used direct IO, the kernel may cache filesystem metadata, so a change in the location of those blocks of the file may result in incorrect behaviour.
So in short - don't do that.
If you wanted, you could access the block device directly - which might be a more successful scheme, but still potentially allowing harmful race-conditions (you cannot guarantee the order of the metadata and data updates by the other device). These could cause you to end up reading junk from the device (if the metadata were updated before the data). You'd better have a mechanism of detecting junk reads in this case.
I am of course, assuming some very simple braindead filesystem such as FAT. That might reasonably be implemented in userspace (mtools, for instance, does)

Is forcing I2C communication safe?

For a project I'm working on I have to talk to a multi-function chip via I2C. I can do this from linux user-space via the I2C /dev/i2c-1 interface.
However, It seems that a driver is talking to the same chip at the same time. This results in my I2C_SLAVE accesses to fail with An errno-value of EBUSY. Well - I can override this via the ioctl I2C_SLAVE_FORCE. I tried it, and it works. My commands reach the chip.
Question: Is it safe to do this? I know for sure that the address-ranges that I write are never accessed by any kernel-driver. However, I am not sure if forcing I2C communication that way may confuse some internal state-machine or so.(I'm not that into I2C, I just use it...)
For reference, the hardware facts:
OS: Linux
Architecture: TI OMAP3 3530
I2C-Chip: TWL4030 (does power, audio, usb and lots of other things..)
I don't know that particular chip, but often you have commands that require a sequence of writes, first to one address to set a certain mode, then you read or write another address -- where the function of the second address changes based on what you wrote to the first one. So if the driver is in the middle of one of those operations, and you interrupt it (or vice versa), you have a race condition that will be difficult to debug. For a reliable solution, you better communicate through the chip's driver...
I mostly agree with #Wim. But I would like to add that this can definitely cause irreversible problems, or destruction, depending on the device.
I know of a Gyroscope (L3GD20) that requires that you don't write to certain locations. The way that the chip is setup, these locations contain manufacturer's settings which determine how the device functions and performs.
This may seem like an easy problem to avoid, but if you think about how I2C works, all of the bytes are passed one bit at a time. If you interrupt in the middle of the transmission of another byte, results can not only be truly unpredictable, but they can also increase the risk of permanent damage exponentially. This is, of course, entirely up to the chip on how to handle the problem.
Since microcontrollers tend to operate at speeds much faster than the bus speeds allowed on I2C, and since the bus speeds themselves are dynamic based on the speeds at which devices process the information, the best bet is to insert pauses or loops between transmissions that wait for things to finish. If you have to, you can even insert a timeout. If these pauses aren't working, then something is wrong with the implementation.

Resources