Single buffer; multiple sockets; single syscall under Linux

Single buffer; multiple sockets; single syscall under Linux - linux

Does Linux have any native kernel facility that enables send()ing a supplied buffer to a set of sockets? A sort-of vectored I/O, except for socket handles rather than for a buffer.
The objective being to reduce the number of u/k transitions involved in situations where-in for example you need to broadcast some state update to n clients which would require iterating through each socket and sending.
One restriction is that TCP sockets must be supported (not under my control)

The answer is no, neither linux nor posix systems have the call you want. I fear that you don't get any advantage of having it, as each of the data streams will follow different paths and that makes copying the buffers in kernel than in user space. Not making copies in user-to-kernel doesn't neccesarily mean doing in kernel mode is better.
Either way, in linux you can implement this kind of mwrite (or msend) system call, as you have the source code. But I'm afraid you won't get anything but a head pain. The only approach to this implementation is some kind of copy-on-divert method, but I don't think you'll get any advantage.
Finnaly, once you have finished the first write(2) call, the probability of having to swap in the user buffer again in the next is far too low, making the second and next copies of the buffer will have very low overhead, as the pages will be all in core memory. You need a very high loaded system to get the user buffer swapped out in the time between syscalls.

Related

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?

Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.

I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.

Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Is it possible to drop down packets

I am trying to write some sort of very basic packet filtering in Linux (Ubuntu) user space.
Is it possible to drop down packets in user space via c program using raw socket (AF_PACKET), without any kernel intervention (such as writing kernel module) and net filtering?
Thanks a lot
Tali

It is possible (assuming I understand what you're asking). There are a number of "zero-copy" driver implementations that allow user-space to obtain a large memory-mapped buffer into which (/ from which) packets are directly DMA'd.
That pretty much precludes having the kernel process those same packets though (possible but very difficult to properly coordinate user-space packet sniffing with kernel processing of the same packets). But it's fine if you're creating your own IDS/IPS or whatever and don't need to "terminate" connections on the local machine.
It would definitely not be the standard AF_PACKET; you have to either create your own or use an existing implementation: look into netmap, DPDK, and PF_RING (maybe PF_RING/ZC? not sure). I worked on a couple of proprietary implementations in a previous career so I know it's possible.
The basic idea is either (1) completely duplicate everything the driver is responsible for -- that is, move the driver implementation completely into user space (DPDK basically does this). This is straight-forward on paper, but is a lot of work and makes the driver pretty much fully custom.
Or (2) modify driver source so that key network buffer allocation requests get satisfied with an address that is also mmap'd by the user-space process. You then have the problem of communicating buffer life-cycles / reference counts between user-space and kernel. That's very messy but can be done and is probably less work overall. (I dunno -- there may be a way to automate this latter method if you're clever enough -- I haven't been in this space in some years.)
Whichever way you go, there are several pieces you need to put together to do this right. For example, if you want really high performance, you'll need to use the adapter's "RSS" type mechanisms to split the traffic into multiple queues and pin each to a particular CPU -- then make sure the corresponding application components are pinned to the same CPU.
All that being said, unless your need is pretty severe, you're best staying with plain old AF_PACKET.

You can use iptable rules to drop packets for a given criteria, but dropping using packet filters is not possible, because the packet filters get a copy of the packet while the original packet flows through usual path.

Why are these special device file reads a minimum of PAGE SIZE bytes?

I am coding my 2nd kernel module ever. I am attempting to provide user-space access to a firmware core, as a demo. The demo is under petalinux (an embedded OS specifically tailored to Zynq or Microblaze). I added virtual file system hooks to go between user space and the kernel module, and it seems to work, both on read and write. The only hiccup is that, somewhere between my user application and my kernel module, the OS balloons the size of my request up to PAGE SIZE (4096).
A co-worker commented that I might be mounting the module as a block device rather than a character device. This makes a lot of sense. Someone upstream of my module is certainly caching my results (which, if my understanding of block drivers is accurate, would make perfect sense for, say, the hard drive), but we're tied to a volatile device, so this isn't appropriate. But all the diagnostics I've been able to find suggest that it is mounted as a character device...
mknod /dev/myModule **c** (Dynamically specified Major Number) (Zero)
ls -la /dev/myModule
**c**rw-r--r-- 1 root root 252, - Jan 1 01:05 myModule
Here is the module source I am using to register the virtual file IO hooks.....
alloc_chrdev_region (&moduleMajorNumber, 0, 1, "moduleLayerCDMA");
register_chrdev_region (&moduleMajorNumber, 1, "moduleLayerCDMA");
cdevP = cdev_alloc();
cdevP->ops = &moduleLayerCDMA_fileOperations;
cdevP->owner = THIS_MODULE;
cdev_add(cdevP, moduleMajorNumber, 1);
Any clues?

Your problem comes from the fact that the standard C library buffered I/O routines (fopen, fclose, fread, fgetch & their friends) keep a user-space buffer for every opened file/device, and when your program tries to read from that file/device, the library routines try to do read-ahead, to prepare for later read calls, to increase the efficiency of the I/O. Similarly, writes with fwrite go through a write buffer, and only get flushed to the system with a system call when the buffer gets full or when closing the file/device or explicitly doing fflush.
There are two ways to solve the issue:
The easier might be to simply convert your user-space program to use non-buffered I/O (open, close, read, write & their friends), these are simply making the corresponding system call on a 1:1 basis.
Or handle the problem in your kernel module: disregard the number of bytes asked in a read if it is more than what you'd like to return in a single system call. You can look at that value as the length of the buffer provided by the caller, and you don't neccessarily have to fill it up completely. Of course, in the return value, you have to indicate how many bytes were actually read.

Writing to a remote file: When does write() really return?

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system

Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.

You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

Limiting the File System Usage Programmatically in Linux

I was assigned to write a system call for Linux kernel, which oddly determines (and reduces) users´ maximum transfer amount per minute (for file operations). This system call will be called lim_fs_usage and will take a parameter for maximum number of bytes all users can access in a minute. For short, I am going to determine bandwidth of all filesystem operations in Linux. The project also asks for choosing appropriate method for distribution of this restricted resource (file access) among the users but I think this
won´t be a big problem.
I did a long long search and scan but could not find a method for managing file system access programmatically. I thought of mapping (mmap())hard drive to memory and manage memory operations but this turned to be useless. I also tried to find an API for virtual file system in order to monitor and limit it but I could not find one. Any ideas, please... Any help is greatly appreciated. Thank you in advance...

I wonder if you could do this as an IO scheduler implementation.
The main difficulty of doing IO bandwidth limitation under Linux is, by the time it reaches anywhere near the device, the kernel has probably long since forgotten who caused it.
Likewise, you can get on some very tricky ground in determining who is responsible for a given piece of IO:
If a binary is demand-loaded, who owns the IO doing that?
A mapped section of memory (demand-loaded executable or otherwise) might be kicked out of memory because someone else used too much ram, thus causing the kernel to choose to evict those pages, which places an unfair burden on the quota of the other user to then page it back in
IO operations can be combined, and might come from different users
A write operation might cause an IO sooner or later depending on how the kernel schedules it; a later schedule may mean that fewer IOs need to be done in the long run, as another write gets done to the same block in the interim; writing to an already dirty block in cache does not make it any dirtier.
If you understand all these and more caveats, and still want to, I imagine doing it as an IO scheduler is the way to go.
IO schedulers are pluggable under Linux (2.6) and can be changed dynamically - the kernel waits for all IO on the device (IO scheduler is switchable per block device) to end and then switches to the new one.

Since it's urgent I'll give you an idea out of the top of my head without doing any research on the feasibility -- what about inserting a hook to monitor system calls that deal with file system access?
You might end up writing specialised kernel modules to handle the various filesystems (ext3, ext4, etc) but as a proof-of-concept you can start with one. Do not forget that root has reserved blocks in memory, process space and disk for his own operations.
Managing memory operations does not sound related to what you're trying to do (but perhaps I am mistaken here).

After a long period of thinking and searching, I decided to use the ¨hooking¨ method proposed. I am thinking of creating a new system call which initializes and manages a global variable like hdd_ bandwith _limit. This variable will be used in Read() and Write() system calls´ modified implementation (instead of ¨count¨ variable). Then I will decide distribution of this resource which is the real issue. Probably I will find out how many users are using the system for a certain moment and divide this resource equally. Will be a Round-Robin-like distribution. But still, I am open to suggestions on this distribution issue. Will it be a SJF or FCFS or Round-Robin? Synchronization is another issue. How can I know a user´s job is short or long? Or whether he is done with the operation or not?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string