What happens after a packet is captured? - linux

I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne

How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.

take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions

Related

"zero copy networking" vs "kernel bypass"?

What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
TL;DR - They are different concepts, but it is quite likely that zero copy is supported within kernel bypass API/framework.
User Bypass
This mode of communicating should also be considered. It maybe possible for DMA-to-DMA transactions which do not involve the CPU at all. The idea is to use splice() or similar functions to avoid user space at all. Note, that with splice(), the entire data stream does not need to bypass user space. Headers can be read in user space and data streamed directly to disk. The most common downfall of this is splice() doesn't do checksum offloading.
Zero copy
The zero copy concept is only that the network buffers are fixed in place and are not moved around. In many cases, this is not really beneficial. Most modern network hardware supports scatter gather, also know as buffer descriptors, etc. The idea is the network hardware understands physical pointers. The buffer descriptor typically consists of,
Data pointer
Length
Next buffer descriptor
The benefit is that the network headers do not need to exist side-by-side and IP, TCP, and Application headers can reside physically seperate from the application data.
If a controller doesn't support this, then the TCP/IP headers must precede the user data so that they can be filled in before sending to the network controller.
zero copy also implies some kernel-user MMU setup so that pages are shared.
Kernel Bypass
Of course, you can bypass the kernel. This is what pcap and other sniffer software has been doing for some time. However, pcap does not prevent the normal kernel processing; but the concept is similar to what a kernel bypass framework would allow. Ie, directly deliver packets to user space where processing headers would happen.
However, it is difficult to see a case where user space will have a definite win unless it is tied to the particular hardware. Some network controllers may have scatter gather supported in the controller and others may not.
There are various incarnation of kernel interfaces to accomplish kernel by-pass. A difficulty is what happens with the received data and producing the data for transmission. Often this involve other devices and so there are many solutions.
To put this together...
Are they two phrases meaning the same thing, or different?
They are different as above hopefully explains.
Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
It is the opposite. Kernel bypass can use zero copy and most likely will support it as the buffers are completely under control of the application. Also, there is no memory sharing between the kernel and user space (meaning no need for MMU shared pages and whatever cache/TLB effects that may cause). So if you are using kernel bypass, it will often be advantageous to support zero copy; so the things may seem the same at first.
If scatter-gather DMA is available (most modern controllers) either user space or the kernel can use it. zero copy is not as useful in this case.
Reference:
Technical reference on OnLoad, a high band width kernel by-pass system.
PF Ring as of 2.6.32, if configured
Linux kernel network buffer management by David Miller. This gives an idea of how the protocols headers/trailers are managed in the kernel.
Zero-copy networking
You're doing zero-copy networking when you never copy the data between the user-space and the kernel-space (I mean memory space). By example:
C language
recv(fd, buffer, BUFFER_SIZE, 0);
By default the data are copied:
The kernel gets the data from the network stack
The kernel copies this data to the buffer, which is in the user-space.
With zero-copy method, the data are not copied and come to the user-space directly from the network stack.
Kernel Bypass
The kernel bypass is when you manage yourself, in the user-space, the network stack and hardware stuff. It is hard, but you will gain a lot of performance (there is zero copy, since all the data are in the user-space). This link could be interesting if you want more information.
ZERO-COPY:
When transmitting and receiving packets,
all packet data must be copied from user-space buffers to kernel-space buffers for transmitting and vice versa for receiving. A zero-copy driver avoids this by having user space and the driver share packet buffer memory directly.
Instead of having the transmit and receive point to buffers in kernel space which will later require to copy, a region of memory in user space is allocated, and mapped to a given region of physical memory, to be shared memory between the kernel buffers and the user-space buffers, then point each descriptor buffer to its corresponding place in the newly allocated memory.
Other examples of kernel bypass and zero copy are DPDK and RDMA. When an application uses DPDK it is bypassing the kernel TCP/IP stack. The application is creating the Ethernet frames and the NIC grabbing those frames with DMA directly from user space memory so it's zero copy because there is no copy from user space to kernel space. Applications can do similar things with RDMA. The application writes to queue pairs that the NIC directly access and transmits. RDMA iblibverbs is used inside the kernel as well so when iSER is using RDMA it's not Kernel bypass but it is zero copy.
http://dpdk.org/
https://www.openfabrics.org/index.php/openfabrics-software.html

How to cope with 320 million 272-byte UDP packets?

So, I have an incoming UDP stream composed of 272 byte packets at a data rate of about 5.12Gb/s (around 320e6 packets per second). This data is being sent by an FPGA-based custom board. The packet size is a limit of the digital design being run, so although theoretically it could be possible to increase it to make things more efficient, it would require a large amount of work. At the receiving end these packets are read and interpreted by a network thread and placed in a circular buffer shared with a buffering thread, which will copy this data to a GPU for processing.
The above setup at the receiving end could cope with 5.12Gb/s for 4096 KB packet (used on a different design) using simple recv calls, however with the current packet size I'm having a hard time keeping up with the packet flow, too much time is being "wasted" in context switching and copying small data segments from kernel space to user space. I did a quick test implementation which uses recvmmsg, however thing didn't improve by much. On average I can processes about 40% of the incoming packets.
So I was wondering whether it was possible to get a handle of the kernel's UDP data buffer for my application (mmap style), or use some sort of zero-copying from kernel to user space?
Alternatively, do you know of any other method which would reduce this overhead and be capable of performing the required processing?
This is running on a Linux machine (kernel 3.2.0-40) using C code.
There is support for mmap packet receiving in Linux.
It's not so easy to use as UDP sockets, because you will receive packets like from RAW socket.
See this for more information.

ioctl vs netlink vs memmap to communicate between kernel space and user space

Got some statistics information of our custom hardware to be displayed whenever user asks for using a command in the Linux user space. This implementation is currently uses PROC interface. We started adding more statistics information then we encountered a problem wherein the particular statistics command had to be executed twice for getting the entire data as PROC interface was restricted to 1 page.
As mentioned above the data transfer between the kernel and the user space is not critical but as per the data some decisions might be taken by the user. Our requirement for this interface design is that it should be capable of transferring amount of data maybe greater that 8192 bytes and the command needs to use minimal kernel resources (like locks etc.,) and it needs to be quick.
Using ioctl can solve the issue but since the command is exactly not controlling the device but to collect some statistics information, not sure whether it is a good mechanism to use as per Linux. We are currently using 3.4 kernel; not sure whether Netlink is lossy in this version (Previous versions I came across issues like when the queue becomes full, socket starts to drop data). mmap is another option . Can anyone suggest me what would be the best interface to use
Kernel services can send information directly to user applications over Netlink, while you’d have explicitly poll the kernel with ioctl functions, a relatively expensive operation.
Netlink comms is very much asynchronous, with each side receiving messages at some point after the other side sends them. ioctls are purely synchronous: “Hey kernel, WAKE UP! I need you to process my request NOW! CHOP CHOP!”
Netlink supports multicast communications between the kernel and multiple user-space processes, while ioctls are strictly one-to-one.
Netlink messages can be lost for various reasons (e.g. out of memory), while ioctls are generally more reliable due to their immediate-processing nature.
So If you asking for statistics to kernel from user space(application) it is more reliable and easy to use IOCTL while if you generate statistics in kernel space and you want your kernel space to send those data to user space(application) you have to use Netlink sockets.
You can do a ioctl IO call (rather than IOR, IOW, or IORW). Ioctl's can be very useful for collecting information. You'll have a lot of flexibility this way in that you can pass different size buffers or structs to fill with data.

udp send from driver

I have a driver that needs to:
receive data from an FPGA
DMA data to another another device (DSP) for encoding
send the encoded data via UDP to an external host
The original plan was to have the application handle step 3, but the application doesn't get the processor in time to process the data before the next set of data arrives from the FPGA.
Is there a way to force the scheduler (from the driver) to run my application?
If not, I think work queues are likely the solution I need to use, but I'm not sure how/where to call into the network stack/driver to accomplish the UDP transfers from the work queues.
Any ideas?
You should try to discover why the application "can't get the data fast enough".
Your memory bandwith is probably vastly superior to the thypical ethernet bandwith, so even if passing data from the driver to the application involves copying.
If the udp link is not fast enough in userspace, it won't be faster in kernelspace.
What you need to do is :
understand why your application is not fast enough, maybe by stracing it.
implement queuing in userspace.
You can probably split your application in two thread, sharing buffer list
thread A waits for the driver to have data available, and puts it at the tail of the list.
thread B reads data from the head of the list, and sends it through UDP. If for some reason thread B is busy waiting for a particular buffer to be sent, the fifo fills a bit, but as long as the UDP link bandwith is larger than the rate of data coming from the DSP you should be fine.
Moving things into the kernel does not makes things magically faster, it is just MUCH harder to code and debug and trace.

Avoid copying of data between user and kernel space and vice-versa

I am developing a active messaging protocol for parallel computation that replaces TCP/IP. My goal is to decrease the latency of a packet. Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency. I am not writing any device driver and i am just trying to replace the TCP/IP stack with something simpler. Now I wanted to avoid copying of a packet's data from user space to kernel space and vice-versa. I heard of the mmap(). Is it the best way to do this? If yes, it will be nice if you can give links to some examples. I am a linux newbie and i really appreciate your help.. Thank you...
Thanks,
Bala
You should use UDP, that is already pretty fast. At least it was fast enough for W32/SQLSlammer to spread through the whole internet.
About your initial question, see the (vm)splice and tee Linux system calls.
From the manpage:
The three system calls splice(2),
vmsplice(2), and tee(2)), provide
userspace programs with full control
over an arbitrary kernel buffer,
implemented within the kernel using
the same type of buffer that is used
for a pipe. In overview, these system
calls perform the following tasks:
splice(2)
moves data from the buffer to an arbitrary file descriptor, or vice
versa, or from one buffer to another.
tee(2)
"copies" the data from one buffer to another.
vmsplice(2)
"copies" data from user space into the buffer.
Though we talk of copying, actual
copies are generally avoided. The
kernel does this by implementing a
pipe buffer as a set of
reference-counted pointers to pages of
kernel memory. The kernel creates
"copies" of pages in a buffer by
creating new pointers (for the output
buffer) referring to the pages, and
increasing the reference counts for
the pages: only pointers are copied,
not the pages of the buffer.
Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency
Generally, even in LAN UDP packets tend to be lost, also they will be lost if client
do not have enough time to consume it...
SO no, do not replace TCP with something else (UDP). Because if you do need reliable delivery TCP would be the fastest (because everything connected to acknowledgments and retransmission is done in kernel space).
Generally in normal case there is no latency drawbacks using TCP (of course do not forget TCP_NODELAY option)
About sharing the memory. Actually all memory you allocate is created with mmap. So the kernel will need to copy it somehow in any case when it creates a packet from driver.
If you are talking about reducing copying it is usually done for files/sockets and
sendfile() used that indeed prevents copying data between kernel and user. But I assume
you do not need to send files.

Resources