Why are these special device file reads a minimum of PAGE SIZE bytes? - linux

I am coding my 2nd kernel module ever. I am attempting to provide user-space access to a firmware core, as a demo. The demo is under petalinux (an embedded OS specifically tailored to Zynq or Microblaze). I added virtual file system hooks to go between user space and the kernel module, and it seems to work, both on read and write. The only hiccup is that, somewhere between my user application and my kernel module, the OS balloons the size of my request up to PAGE SIZE (4096).
A co-worker commented that I might be mounting the module as a block device rather than a character device. This makes a lot of sense. Someone upstream of my module is certainly caching my results (which, if my understanding of block drivers is accurate, would make perfect sense for, say, the hard drive), but we're tied to a volatile device, so this isn't appropriate. But all the diagnostics I've been able to find suggest that it is mounted as a character device...
mknod /dev/myModule **c** (Dynamically specified Major Number) (Zero)
ls -la /dev/myModule
**c**rw-r--r-- 1 root root 252, - Jan 1 01:05 myModule
Here is the module source I am using to register the virtual file IO hooks.....
alloc_chrdev_region (&moduleMajorNumber, 0, 1, "moduleLayerCDMA");
register_chrdev_region (&moduleMajorNumber, 1, "moduleLayerCDMA");
cdevP = cdev_alloc();
cdevP->ops = &moduleLayerCDMA_fileOperations;
cdevP->owner = THIS_MODULE;
cdev_add(cdevP, moduleMajorNumber, 1);
Any clues?

Your problem comes from the fact that the standard C library buffered I/O routines (fopen, fclose, fread, fgetch & their friends) keep a user-space buffer for every opened file/device, and when your program tries to read from that file/device, the library routines try to do read-ahead, to prepare for later read calls, to increase the efficiency of the I/O. Similarly, writes with fwrite go through a write buffer, and only get flushed to the system with a system call when the buffer gets full or when closing the file/device or explicitly doing fflush.
There are two ways to solve the issue:
The easier might be to simply convert your user-space program to use non-buffered I/O (open, close, read, write & their friends), these are simply making the corresponding system call on a 1:1 basis.
Or handle the problem in your kernel module: disregard the number of bytes asked in a read if it is more than what you'd like to return in a single system call. You can look at that value as the length of the buffer provided by the caller, and you don't neccessarily have to fill it up completely. Of course, in the return value, you have to indicate how many bytes were actually read.

Related

If I private-`mmap` a file and read it, then another process writes to the same file, will another read at the same location return the same value?

(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?
Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.
I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.
Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Single buffer; multiple sockets; single syscall under Linux

Does Linux have any native kernel facility that enables send()ing a supplied buffer to a set of sockets? A sort-of vectored I/O, except for socket handles rather than for a buffer.
The objective being to reduce the number of u/k transitions involved in situations where-in for example you need to broadcast some state update to n clients which would require iterating through each socket and sending.
One restriction is that TCP sockets must be supported (not under my control)
The answer is no, neither linux nor posix systems have the call you want. I fear that you don't get any advantage of having it, as each of the data streams will follow different paths and that makes copying the buffers in kernel than in user space. Not making copies in user-to-kernel doesn't neccesarily mean doing in kernel mode is better.
Either way, in linux you can implement this kind of mwrite (or msend) system call, as you have the source code. But I'm afraid you won't get anything but a head pain. The only approach to this implementation is some kind of copy-on-divert method, but I don't think you'll get any advantage.
Finnaly, once you have finished the first write(2) call, the probability of having to swap in the user buffer again in the next is far too low, making the second and next copies of the buffer will have very low overhead, as the pages will be all in core memory. You need a very high loaded system to get the user buffer swapped out in the time between syscalls.

Linux I2C file handles - safe to cache?

I am just starting to look at the I2C support on (embedded) linux (Beaglebone Black, to be precise). Since it's linux, everything is a file, so it's no surprise that I2C is too.
int file = open( "/dev/i2c-0", O_RDWR );
and then the actual address on that bus is selected through ioctl(). My question is this - is it safe, or even legal, to cache file for the duration of application execution? It would seem to my naive eyes that the overheading of opening a resource to read every 250ms would be an unnecessary strain on the kernel. So it is valid to open, and then just use ioctl() to switch address whenever I need to, or must I close() the descriptor between reads and writes?
is it safe, or even legal, to cache file for the duration of application execution?
The file descriptor (that is returned from open()) is valid for as long as your program needs to keep executing.
Device nodes in /dev may resemble filenames, but they are treated differently from filesystem entries once you look past the syscall interface. An open() or read() on a file descriptor for a device will invoke the device driver, whereas for an actual file its filesystem is invoked, which may eventually call a storage device driver.
It would seem to my naive eyes that the overheading of opening a resource to read every 250ms would be an unnecessary strain on the kernel.
Yes it would since those open() and close() syscalls are unnecessary.
So it is valid to open, and then just use ioctl() to switch address whenever I need to,
Yes, that is the proper useage.
or must I close() the descriptor between reads and writes?
That is not necessary nor advised.

How to ensure read() to read data from the real device each time?

I'm periodically reading from a file and checking the readout to decide subsequent action. As this file may be modified by some mechanism which will bypass the block file I/O manipulation layer in the Linux kernel, I need to ensure the read operation reading data from the real underlying device instead of the kernel buffer.
I know fsync() can make sure all I/O write operations completed with all data written to the real device, but it's not for I/O read operations.
The file has to be kept opened.
So could anyone please kindly tell me how I can do to meet such requirement in Linux system? is there such a API similar to fsync() that can be called?
Really appreciate your help!
I believe that you want to use the O_DIRECT flag to open().
I think memory mapping in combination with madvise() and/or posix_fadvise() should satisfy your requirements... Linus contrasts this with O_DIRECT at http://kerneltrap.org/node/7563 ;-).
You are going to be in trouble if another device is writing to the block device at the same time as the kernel.
The kernel assumes that the block device won't be written by any other party than itself. This is true even if the filesystem is mounted readonly.
Even if you used direct IO, the kernel may cache filesystem metadata, so a change in the location of those blocks of the file may result in incorrect behaviour.
So in short - don't do that.
If you wanted, you could access the block device directly - which might be a more successful scheme, but still potentially allowing harmful race-conditions (you cannot guarantee the order of the metadata and data updates by the other device). These could cause you to end up reading junk from the device (if the metadata were updated before the data). You'd better have a mechanism of detecting junk reads in this case.
I am of course, assuming some very simple braindead filesystem such as FAT. That might reasonably be implemented in userspace (mtools, for instance, does)

Resources