How to cope with 320 million 272-byte UDP packets? - linux

So, I have an incoming UDP stream composed of 272 byte packets at a data rate of about 5.12Gb/s (around 320e6 packets per second). This data is being sent by an FPGA-based custom board. The packet size is a limit of the digital design being run, so although theoretically it could be possible to increase it to make things more efficient, it would require a large amount of work. At the receiving end these packets are read and interpreted by a network thread and placed in a circular buffer shared with a buffering thread, which will copy this data to a GPU for processing.
The above setup at the receiving end could cope with 5.12Gb/s for 4096 KB packet (used on a different design) using simple recv calls, however with the current packet size I'm having a hard time keeping up with the packet flow, too much time is being "wasted" in context switching and copying small data segments from kernel space to user space. I did a quick test implementation which uses recvmmsg, however thing didn't improve by much. On average I can processes about 40% of the incoming packets.
So I was wondering whether it was possible to get a handle of the kernel's UDP data buffer for my application (mmap style), or use some sort of zero-copying from kernel to user space?
Alternatively, do you know of any other method which would reduce this overhead and be capable of performing the required processing?
This is running on a Linux machine (kernel 3.2.0-40) using C code.

There is support for mmap packet receiving in Linux.
It's not so easy to use as UDP sockets, because you will receive packets like from RAW socket.
See this for more information.

Related

Should I send data in chunks, or send it all at once?

I have python code that sends data to socket (a rather large file). Should I divide it into 1kb chunks, or would just conn.sendall(file.read()) be acceptable?
It will make little difference to the sending operation. (I assume you are using a TCP socket for the purposes of this discussion.)
When you attempt to send 1K, the kernel will take that 1K, copy it into kernel TCP buffers, and return success (and probably begin sending to the peer at the same time). At which point, you will send another 1K and the same thing happens. Eventually if the file is large enough, and the network can't send it fast enough, or the receiver can't drain it fast enough, the kernel buffer space used by your data will reach some internal limit and your process will be blocked until the receiver drains enough data. (This limit can often be pretty high with TCP -- depending on the OSes, you may be able to send a megabyte or two without ever hitting it.)
If you try to send in one shot, pretty much the same thing will happen: data will be transferred from your buffer into kernel buffers until/unless some limit is reached. At that point, your process will be blocked until data is drained by the receiver (and so forth).
However, with the first mechanism, you can send a file of any size without using undue amounts of memory -- your in-memory buffer (not including the kernel TCP buffers) only needs to be 1K long. With the sendall approach, file.read() will read the entire file into your program's memory. If you attempt that with a truly giant file (say 40G or something), that might take more memory than you have, even including swap space.
So, as a general purpose mechanism, I would definitely favor the first approach. For modern architectures, I would use a larger buffer size than 1K though. The exact number probably isn't too critical; but you could choose something that will fit several disk blocks at once, say, 256K.

What's the practical limit on the size of single packet transmitted over domain socket?

Let us assume that there is a Unix domain socket created for a typical server-client program. The client sends a 10GB buffer over the socket and it is consumed by the server in the meanwhile.
Does OS (Linux/BSD) split the 10GB buffer into many packets and send/consume them, or are they sent at once?
If it is not possible to send 10GB buffer of domain socket in one go, then what is the practical size limit of a single packet?
Constraints:
The program will run on both Linux 2.6.32+ and FreeBSD 9+
Size of the buffer to be sent ranges from 3 bytes to 10GB maximum.
There are a number of factors which will determine the maximum of size of a packet that can be sent on a Unix socket:
The wmem_max socket send buffer maximum size kernel setting, which determines the maximum size of the send buffer that can be set using setsockopt (SO_SNDBUF). The current setting can be read from /proc/sys/net/core/wmem_max and can be set using sysctl net.core.wmem_max=VALUE (add the setting to /etc/sysctl.conf to make the change persistent across reboots). Note this setting applies to all sockets and socket protocols, not just to Unix sockets.
If multiple packets are sent to a Unix socket (using SOCK_DATAGRAM), then the maximum amount of data which can be sent without blocking depends on both the size of the socket send buffer (see above) and the maximum number of unread packets on the Unix socket (kernel parameter net.unix.max_dgram_qlen).
Finally, a packet (SOCK_DATAGRAM) requires contiguous memory (as per What is the max size of AF_UNIX datagram message that can be sent in linux?). How much contiguous memory is available in the kernel will depend on many factors (e.g. the I/O load on the system, etc...).
So to maximize the performance on your application, you need a large socket buffer size (to minimize the user/kernel space context switches due to socket write system calls) and a large Unix socket queue (to decouple the producer and consumer as much as possible). However, the product of the socket send buffer size and queue length must not be so large as to cause the kernel to run out of contiguous memory areas (causing write failures).
The actual figures will depend on your system configuration and usage. You will need to determine the limits by testing... start say with wmem_max at 256Kb and max_dgram_qlen at 32 and keep doubling wmem_max until you notice things start breaking. You will need to adjust max_dgram_qlen to balance the activity of the producer and consumer to a certain extent (although if the producer is much faster or much slower than the consumer, the queue size won't have much affect).
Note your producer will have to specifically setup the socket send buffer size to wmem_max bytes with a call to setsockopt (SO_SNDBUF) and will have to split data into wmem_max byte chunks (and the consumer will have to reassemble them).
Best guess: the practical limits will be around wmem_max ~8Mb and unix_dgram_qlen ~32.
There are no "packets" per se with domain sockets. The semantics of tcp "streams" or udp "datagrams" are sort of simulated w/i the kernel to look similar to user space apps but that's about as far as it goes. The mechanics aren't as involved as network sockets using network protocols. What you are really interested in here is how much the kernel will buffer for you.
From your program's perspective it doesn't really matter. Think of the socket as a pipe or FIFO. When the buffer fills you are going to block; if the socket is non-blocking you are going to get short writes (assuming streams) or error with EAGAIN. This is true regardless of the size of the buffer. However you should be able query the buffer size with getsockopt and to increase its size with setsockopt but I doubt you are going to get anywhere near 10GB.
Alternatively, you might look at sendfile.
There are two ideas here. One is the size of the packet sent if using SOCK_DGRAM and the other is the size of the buffer for the domain socket. This depends on the variables set with the domain socket. Size can depend if it is a memory file socket.
If you're talking about SOCK_DGRAM, it is easily determined by experiment. It seems a lot more likely that you're talking about SOCK_STREAM, in which case it simply does not matter. SOCK_STREAM will sort it outer you. Just write in whatever size chunks you like: the larger the better.

udp send from driver

I have a driver that needs to:
receive data from an FPGA
DMA data to another another device (DSP) for encoding
send the encoded data via UDP to an external host
The original plan was to have the application handle step 3, but the application doesn't get the processor in time to process the data before the next set of data arrives from the FPGA.
Is there a way to force the scheduler (from the driver) to run my application?
If not, I think work queues are likely the solution I need to use, but I'm not sure how/where to call into the network stack/driver to accomplish the UDP transfers from the work queues.
Any ideas?
You should try to discover why the application "can't get the data fast enough".
Your memory bandwith is probably vastly superior to the thypical ethernet bandwith, so even if passing data from the driver to the application involves copying.
If the udp link is not fast enough in userspace, it won't be faster in kernelspace.
What you need to do is :
understand why your application is not fast enough, maybe by stracing it.
implement queuing in userspace.
You can probably split your application in two thread, sharing buffer list
thread A waits for the driver to have data available, and puts it at the tail of the list.
thread B reads data from the head of the list, and sends it through UDP. If for some reason thread B is busy waiting for a particular buffer to be sent, the fifo fills a bit, but as long as the UDP link bandwith is larger than the rate of data coming from the DSP you should be fine.
Moving things into the kernel does not makes things magically faster, it is just MUCH harder to code and debug and trace.

What happens after a packet is captured?

I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions

Where are possible locations of queueing/buffering delays in Linux multicast?

We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity.
The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal.
The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()), then call send(). The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.)
We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it.
For what it's worth, we're using CentOS 4.x (RHEL kernel 2.6.9).
This is a great question. On CentOS like most flavors of *nix there is a UDP receive/send buffer for every multicast socket. The size of this buffer is controlled by sysctl.conf you can view the size of your buffers by calling /sbin/sysctl -a
The below items show my default and max udp receive size in bytes. The larger these numbers the more buffering and therefor latency the network/kernel can introduce if your application is too slow in consuming the data. If you have built in good tolerance for data loss you can make these buffers very tiny and you will not see the latency build up and recovery you described above. The trade off is data loss as the buffer overflows - something you may be seeing already.
[~]$ /sbin/sysctl -a | mem
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
In most cases you need to set default = to your max unless you are controlling this when you create your socket.
the last thing you can do (depending on your kernel version) is view the UDP stats of the PID for your process or at the very least the box overall.
cat /proc/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
cat /proc/PID/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
If it wasn't clear from my post, the latency is due to your application not consuming the data fast enough and forcing the kernel to buffer traffic in the above structure. The network, kernel, and even your network card ring buffers can play a roll in latency but all those items typically only add a few milliseconds.
Let me know your thoughts and I can give you more information on where to look in your app to squeeze some more performance.
Packets can queue up in the send and receive side kernel, the NIC and the networking infrastructure. You will find a plethora of items you can test and tweak.
For the NIC you can usually find interrupt coalescing parameters - how long the NIC will wait before notifying the kernel or sending to the wire whilst waiting to batch packets.
For Linux you have the send and receive "buffers", the larger they are the more likely you are to experience higher latency as packets get handled in batched operations.
For the architecture and Linux version you have to be aware of how expensive context switches are and whether there are locks or pre-emptive scheduling enabled. Consider minimizing the number of applications running, using process affinity to lock processes to particular cores.
Don't forget timing, the Linux kernel version you are using has pretty terrible accuracy on the gettimeofday() clock (2-4ms) and is quite an expensive call. Consider using alternatives such as reading from the core TSC or an external HPET device.
Diagram from Intel:
alt text http://www.theinquirer.net/IMG/142/96142/latency-580x358.png?1272514422
If you decide you need to capture packets in the production environment, it may be worth looking at using monitor ports on your switches and capture the packets using non-production machines. That'll also allow you to capture the packets on multiple points across the transmission path and compare what you're seeing.

Resources