Control V4L2/VB2 Buffer Allocation? - linux

I am trying to write a V4L2 compliant driver for a special camera device i have, but the device doesn't seem particularly friendly with V4L2's buffer system. Instead of the separately allocated buffers, it wants a single contiguous block of memory capable of holding a set # of buffers (usually 4), and it then provides a status register telling you which is the latest (updated after each frame is DMA'ed to the host). So it basically needs only a single large DMA allocated memory chunk to work with, not 4 most-likely separated.
How can I use this with V4L? Everything I see about VIDIOC_CREATE_BUFS, VIDIOC_REQBUFS and such does internal allocation of the buffers, and I can't get anything V4L-based (like qv4l2 to work without a successful QBUF and DQBUF that uses their internal structure.
How can this be done?

Just for completion, I finally found a solution in the "meye" driver. I removed everything VB2 and wrote my own reqbuf, querybuf, qbuf, and dqbuf, along with my own mmap routines to handle the allocation. And it all works!

Related

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?
Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.
I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.
Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Is it possible to drop down packets

I am trying to write some sort of very basic packet filtering in Linux (Ubuntu) user space.
Is it possible to drop down packets in user space via c program using raw socket (AF_PACKET), without any kernel intervention (such as writing kernel module) and net filtering?
Thanks a lot
Tali
It is possible (assuming I understand what you're asking). There are a number of "zero-copy" driver implementations that allow user-space to obtain a large memory-mapped buffer into which (/ from which) packets are directly DMA'd.
That pretty much precludes having the kernel process those same packets though (possible but very difficult to properly coordinate user-space packet sniffing with kernel processing of the same packets). But it's fine if you're creating your own IDS/IPS or whatever and don't need to "terminate" connections on the local machine.
It would definitely not be the standard AF_PACKET; you have to either create your own or use an existing implementation: look into netmap, DPDK, and PF_RING (maybe PF_RING/ZC? not sure). I worked on a couple of proprietary implementations in a previous career so I know it's possible.
The basic idea is either (1) completely duplicate everything the driver is responsible for -- that is, move the driver implementation completely into user space (DPDK basically does this). This is straight-forward on paper, but is a lot of work and makes the driver pretty much fully custom.
Or (2) modify driver source so that key network buffer allocation requests get satisfied with an address that is also mmap'd by the user-space process. You then have the problem of communicating buffer life-cycles / reference counts between user-space and kernel. That's very messy but can be done and is probably less work overall. (I dunno -- there may be a way to automate this latter method if you're clever enough -- I haven't been in this space in some years.)
Whichever way you go, there are several pieces you need to put together to do this right. For example, if you want really high performance, you'll need to use the adapter's "RSS" type mechanisms to split the traffic into multiple queues and pin each to a particular CPU -- then make sure the corresponding application components are pinned to the same CPU.
All that being said, unless your need is pretty severe, you're best staying with plain old AF_PACKET.
You can use iptable rules to drop packets for a given criteria, but dropping using packet filters is not possible, because the packet filters get a copy of the packet while the original packet flows through usual path.

Single buffer; multiple sockets; single syscall under Linux

Does Linux have any native kernel facility that enables send()ing a supplied buffer to a set of sockets? A sort-of vectored I/O, except for socket handles rather than for a buffer.
The objective being to reduce the number of u/k transitions involved in situations where-in for example you need to broadcast some state update to n clients which would require iterating through each socket and sending.
One restriction is that TCP sockets must be supported (not under my control)
The answer is no, neither linux nor posix systems have the call you want. I fear that you don't get any advantage of having it, as each of the data streams will follow different paths and that makes copying the buffers in kernel than in user space. Not making copies in user-to-kernel doesn't neccesarily mean doing in kernel mode is better.
Either way, in linux you can implement this kind of mwrite (or msend) system call, as you have the source code. But I'm afraid you won't get anything but a head pain. The only approach to this implementation is some kind of copy-on-divert method, but I don't think you'll get any advantage.
Finnaly, once you have finished the first write(2) call, the probability of having to swap in the user buffer again in the next is far too low, making the second and next copies of the buffer will have very low overhead, as the pages will be all in core memory. You need a very high loaded system to get the user buffer swapped out in the time between syscalls.

Why are these special device file reads a minimum of PAGE SIZE bytes?

I am coding my 2nd kernel module ever. I am attempting to provide user-space access to a firmware core, as a demo. The demo is under petalinux (an embedded OS specifically tailored to Zynq or Microblaze). I added virtual file system hooks to go between user space and the kernel module, and it seems to work, both on read and write. The only hiccup is that, somewhere between my user application and my kernel module, the OS balloons the size of my request up to PAGE SIZE (4096).
A co-worker commented that I might be mounting the module as a block device rather than a character device. This makes a lot of sense. Someone upstream of my module is certainly caching my results (which, if my understanding of block drivers is accurate, would make perfect sense for, say, the hard drive), but we're tied to a volatile device, so this isn't appropriate. But all the diagnostics I've been able to find suggest that it is mounted as a character device...
mknod /dev/myModule **c** (Dynamically specified Major Number) (Zero)
ls -la /dev/myModule
**c**rw-r--r-- 1 root root 252, - Jan 1 01:05 myModule
Here is the module source I am using to register the virtual file IO hooks.....
alloc_chrdev_region (&moduleMajorNumber, 0, 1, "moduleLayerCDMA");
register_chrdev_region (&moduleMajorNumber, 1, "moduleLayerCDMA");
cdevP = cdev_alloc();
cdevP->ops = &moduleLayerCDMA_fileOperations;
cdevP->owner = THIS_MODULE;
cdev_add(cdevP, moduleMajorNumber, 1);
Any clues?
Your problem comes from the fact that the standard C library buffered I/O routines (fopen, fclose, fread, fgetch & their friends) keep a user-space buffer for every opened file/device, and when your program tries to read from that file/device, the library routines try to do read-ahead, to prepare for later read calls, to increase the efficiency of the I/O. Similarly, writes with fwrite go through a write buffer, and only get flushed to the system with a system call when the buffer gets full or when closing the file/device or explicitly doing fflush.
There are two ways to solve the issue:
The easier might be to simply convert your user-space program to use non-buffered I/O (open, close, read, write & their friends), these are simply making the corresponding system call on a 1:1 basis.
Or handle the problem in your kernel module: disregard the number of bytes asked in a read if it is more than what you'd like to return in a single system call. You can look at that value as the length of the buffer provided by the caller, and you don't neccessarily have to fill it up completely. Of course, in the return value, you have to indicate how many bytes were actually read.

How to portably extend a file accessed using mmap()

We're experimenting with changing SQLite, an embedded database system,
to use mmap() instead of the usual read() and write() calls to access
the database file on disk. Using a single large mapping for the entire
file. Assume that the file is small enough that we have no trouble
finding space for this in virtual memory.
So far so good. In many cases using mmap() seems to be a little faster
than read() and write(). And in some cases much faster.
Resizing the mapping in order to commit a write-transaction that
extends the database file seems to be a problem. In order to extend
the database file, the code could do something like this:
ftruncate(); // extend the database file on disk
munmap(); // unmap the current mapping (it's now too small)
mmap(); // create a new, larger, mapping
then copy the new data into the end of the new memory mapping.
However, the munmap/mmap is undesirable as it means the next time each
page of the database file is accessed a minor page fault occurs and
the system has to search the OS page cache for the correct frame to
associate with the virtual memory address. In other words, it slows
down subsequent database reads.
On Linux, we can use the non-standard mremap() system call instead
of munmap()/mmap() to resize the mapping. This seems to avoid the
minor page faults.
QUESTION: How should this be dealt with on other systems, like OSX,
that do not have mremap()?
We have two ideas at present. And a question regarding each:
1) Create mappings larger than the database file. Then, when extending
the database file, simply call ftruncate() to extend the file on
disk and continue using the same mapping.
This would be ideal, and seems to work in practice. However, we're
worried about this warning in the man page:
"The effect of changing the size of the underlying file of a
mapping on the pages that correspond to added or removed regions of
the file is unspecified."
QUESTION: Is this something we should be worried about? Or an anachronism
at this point?
2) When extending the database file, use the first argument to mmap()
to request a mapping corresponding to the new pages of the database
file located immediately after the current mapping in virtual
memory. Effectively extending the initial mapping. If the system
can't honour the request to place the new mapping immediately after
the first, fall back to munmap/mmap.
In practice, we've found that OSX is pretty good about positioning
mappings in this way, so this trick works there.
QUESTION: if the system does allocate the second mapping immediately
following the first in virtual memory, is it then safe to eventually
unmap them both using a single big call to munmap()?
2 will work but you don't have to rely on the OS happening to have space available, you can reserve your address space beforehand so your fixed mmapings will always succeed.
For instance, To reserve one gigabyte of address space. Do a
mmap(NULL, 1U << 30, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Which will reserve one gigabyte of continuous address space without actually allocating any memory or resources. You can then perform future mmapings over this space and they will succeed. So mmap the file into the beginning of the space returned, then mmap further sections of the file as needed using the fixed flag. The mmaps will succeed because your address space is already allocated and reserved by you.
Note: linux also has the MAP_NORESERVE flag which is the behavior you would want for the initial mapping if you were allocating RAM, but in my testing it is ignored as PROT_NONE is sufficient to say you don't want any resources allocated yet.
I think #2 is the best currently available solution. In addition to this, on 64bit systems you may create your mapping explicitly at an address that OS would never choose for an mapping (for example 0x6000 0000 0000 0000 in Linux) to avoid the case that OS cannot place the new mapping immediatly after the first one.
It is always safe to unmap mutiple mappinsg with a single munmap call. You can even unmap a part of the mapping if you wish to do so.
Use fallocate() instead of ftruncate() where available. If not, just open file in O_APPEND mode and increase file by writing some amount of zeroes. This greatly reduce fragmentation.
Use "Huge pages" if available - this greatly reduce overhead on big mappings.
pread()/pwrite()/pwritev()/preadv() with not-so-small block size is not slow really. Much faster than IO can actually be performed.
IO errors when using mmap() will generate just segfault instead of EIO or so.
The most of SQLite WRITE performance problems is concentrated in good transactional use (i.e. you should debug when COMMIT actually performed).

Resources