Is there a benefit using PACKET_MMAP? (Linux) - linux

I'm struggeling with why I should try the multi packet send PACKET_MMAP method.
I got a data mass of around 3 milion bytes every 20ms that i'm going to send over a 10gbps interface.
I need to process all data in an packet so the data is going to be in the cache then I just send it the 'normal' way (sendto). in this case the move to kernel would be from cache so thats one memory transfer.
Since I need to process all data in the packet using PACKET_MMAP would also be one move of the data from userspace to userspace then DMA from userspace. So would PACKET_MMAP gain me anything? my guess is that it would not since both methods will move the data once and even thoug it looks like two times in the (sendto) case since the data will reside in the cache it will effectivly be only once..
Am I wrong?
Thanks for any help.
/Anders.

Related

Latency in ioread

Suppose you have a PCIE device presenting a single BAR and one DMA area declared with pci_alloc_consistent(..). The BAR's flags indicate non-prefetchable, non-cacheable, memory region.
What are the principle causes for latency in reading the DMA area, and similarly, what are the causes of latency reading the BAR?
Thank you for answering this simple question :D!
This smells a bit like homework but I suspect the concepts are not well understood by many so I'll add an answer.
The best way to think through this is to consider what needs to happen in order for a read to complete. The CPU and the device are on separate sides of the PCIe link. It's helpful to view PCI-Express as a mini network. Each link is point-to-point (like your PC connected to another PC). There may also be intermediate switches (aka bridges in PCI). In that case, it's like your PC is connected to a switch that is in turn connected to the other PC.
So, if the CPU wants to read its own memory (the "DMA" region you allocated), it's relatively fast. It has a high speed bus that is designed to make that happen fast. Also, there are multiple layers of caching built in to keep frequently (or recently) used data "close" to the CPU.
But if the CPU wants to read from the BAR in the device, the CPU (actually the PCIe root complex integrated with the CPU) must compose a PCIe read request, send the request, and wait while the device decodes the request, accesses the BAR location and sends back the requested data. Tick tock. Your CPU is doing nothing else while it waits for this to complete.
This is pretty much analogous to asking for a web page from another computer. You formulate an HTTP request, send it and wait while the web server accesses the content, formulates a return packet and sends it to you.
If the device wishes to access memory residing "in" the CPU, it's pretty much the exact same thing in reverse. ("Direct memory access" just means that it doesn't need to interrupt the CPU to handle it, but something [the root complex here] is still responsible for decoding the request, fulfilling the read and sending back the resulting data.)
Also, if there are intermediate PCIe switches between CPU and device, those may add additional buffering/queuing delays (exactly as a switch or router might in a network). And any such delays are doubled since they're incurred in both directions.
Of course PCIe is very fast, so all of that happens in mere nanoseconds, but that's still orders of magnitude slower than a "local" read.

Large Data Flow between User and Kernel

What is the best way(performance) to have a bi-directional data flow between user-level and kernel-level ?
I understand that you can open a NETLINK socket and transfer the data through there. But, we have to adopt some other user-kernel interaction(system calls, ioctl) for sending control information across. Is this the most efficient way to transfer large amount of data across user-kernel boundary ?
Passing large buffers of data into the kernel driver/thread/whatever is no problem - the kernel has the privilege to read it, no problem. For returning stuff, the ususal way is to provide the kernel thingy with a sufficiently large user-space buffer, or buffer pool, for it to return data in. That's how its done for the usual stuff - file/network read/write, for example.
What is the problem, more exactly - do you need to transfer the data to/from kernel level on a different machine?
Rgds,
Martin

udp send from driver

I have a driver that needs to:
receive data from an FPGA
DMA data to another another device (DSP) for encoding
send the encoded data via UDP to an external host
The original plan was to have the application handle step 3, but the application doesn't get the processor in time to process the data before the next set of data arrives from the FPGA.
Is there a way to force the scheduler (from the driver) to run my application?
If not, I think work queues are likely the solution I need to use, but I'm not sure how/where to call into the network stack/driver to accomplish the UDP transfers from the work queues.
Any ideas?
You should try to discover why the application "can't get the data fast enough".
Your memory bandwith is probably vastly superior to the thypical ethernet bandwith, so even if passing data from the driver to the application involves copying.
If the udp link is not fast enough in userspace, it won't be faster in kernelspace.
What you need to do is :
understand why your application is not fast enough, maybe by stracing it.
implement queuing in userspace.
You can probably split your application in two thread, sharing buffer list
thread A waits for the driver to have data available, and puts it at the tail of the list.
thread B reads data from the head of the list, and sends it through UDP. If for some reason thread B is busy waiting for a particular buffer to be sent, the fifo fills a bit, but as long as the UDP link bandwith is larger than the rate of data coming from the DSP you should be fine.
Moving things into the kernel does not makes things magically faster, it is just MUCH harder to code and debug and trace.

Avoid copying of data between user and kernel space and vice-versa

I am developing a active messaging protocol for parallel computation that replaces TCP/IP. My goal is to decrease the latency of a packet. Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency. I am not writing any device driver and i am just trying to replace the TCP/IP stack with something simpler. Now I wanted to avoid copying of a packet's data from user space to kernel space and vice-versa. I heard of the mmap(). Is it the best way to do this? If yes, it will be nice if you can give links to some examples. I am a linux newbie and i really appreciate your help.. Thank you...
Thanks,
Bala
You should use UDP, that is already pretty fast. At least it was fast enough for W32/SQLSlammer to spread through the whole internet.
About your initial question, see the (vm)splice and tee Linux system calls.
From the manpage:
The three system calls splice(2),
vmsplice(2), and tee(2)), provide
userspace programs with full control
over an arbitrary kernel buffer,
implemented within the kernel using
the same type of buffer that is used
for a pipe. In overview, these system
calls perform the following tasks:
splice(2)
moves data from the buffer to an arbitrary file descriptor, or vice
versa, or from one buffer to another.
tee(2)
"copies" the data from one buffer to another.
vmsplice(2)
"copies" data from user space into the buffer.
Though we talk of copying, actual
copies are generally avoided. The
kernel does this by implementing a
pipe buffer as a set of
reference-counted pointers to pages of
kernel memory. The kernel creates
"copies" of pages in a buffer by
creating new pointers (for the output
buffer) referring to the pages, and
increasing the reference counts for
the pages: only pointers are copied,
not the pages of the buffer.
Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency
Generally, even in LAN UDP packets tend to be lost, also they will be lost if client
do not have enough time to consume it...
SO no, do not replace TCP with something else (UDP). Because if you do need reliable delivery TCP would be the fastest (because everything connected to acknowledgments and retransmission is done in kernel space).
Generally in normal case there is no latency drawbacks using TCP (of course do not forget TCP_NODELAY option)
About sharing the memory. Actually all memory you allocate is created with mmap. So the kernel will need to copy it somehow in any case when it creates a packet from driver.
If you are talking about reducing copying it is usually done for files/sockets and
sendfile() used that indeed prevents copying data between kernel and user. But I assume
you do not need to send files.

What happens after a packet is captured?

I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions

Resources