What is the main difference between RSS, RPS and RFS? - linux

As known, there are: https://www.kernel.org/doc/Documentation/networking/scaling.txt
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Does it meant that:
RSS - allows us to use many CPU-Cores to process Soft-irq from Ethernet (one CPU-Core for each Ethernet queue)
RPS - allows us to process Soft-irq for all packets from one the same connection on one and the same CPU-Core
RFS - allows us to process Soft-irq for all packets from one the same connection on one and the same CPU-Core on which thread of our Application procces this connection
Is that correct?

Quotes are from https://www.kernel.org/doc/Documentation/networking/scaling.txt.
RSS: Receive Side Scaling - is hardware implemented and hashes some bytes of packets ("hash function over the network and/or transport layer headers-- for example, a 4-tuple hash over IP addresses and TCP ports of a packet"). Implementations are different, some may not filter most useful bytes or may be limited in other ways. This filtering and queue distribution is fast (only several additional cycles are needed in hw to classify packet), but not portable between some network cards or can't be used with tunneled packets or some rare protocols. And sometimes your hardware have no support of number of queues enough to get one queue per logical CPU core.
RSS should be enabled when latency is a concern or whenever receive
interrupt processing forms a bottleneck. Spreading load between CPUs
decreases queue length.
Receive Packet Steering (RPS) "is logically a software implementation of
RSS. Being in software, it is necessarily called later in the datapath.". So, this is software alternative to hardware RSS (still parses some bytes to hash them into queue id), when you use hardware without RSS or want to classify based on more complex rule than hw can or have protocol which can't be parsed in HW RSS classifier. But with RPS more CPU resources are used and there is additional inter-CPU traffic.
RPS has some advantages over RSS: 1) it can be used with any NIC,
2) software filters can easily be added to hash over new protocols,
3) it does not increase hardware device interrupt rate (although it does
introduce inter-processor interrupts (IPIs)).
RFS: Receive Flow Steering is like RSS (software mechanism with more CPU overhead), but it not just hashing into pseudo-random queue id, but takes "into account application locality." (so, packet processing will probably be faster due to good locality). Queues are tracked to be more local to the thread which will process received data, and packets are delivered to correct CPU core.
The goal of RFS is to increase datacache hitrate by steering
kernel processing of packets to the CPU where the application thread
consuming the packet is running. RFS relies on the same RPS mechanisms
to enqueue packets onto the backlog of another CPU and to wake up that
CPU. ... In RFS, packets are not forwarded directly by the value of their hash,
but the hash is used as index into a flow lookup table. This table maps
flows to the CPUs where those flows are being processed.
Accelerated RFS - RFS with hw support. (Check your network driver for ndo_rx_flow_steer) "Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load balancing mechanism that uses soft state to steer flows based on where the application thread consuming the packets of each flow is running.".
Similar method for packet transmitting (but packet is already generated and ready to be send, just select best queue to send it with - and to easier post-processing like freeing skb)
XPS: Transmit Packet Steering: "a mapping from CPU to hardware queue(s) is
recorded. The goal of this mapping is usually to assign queues
exclusively to a subset of CPUs, where the transmit completions for
these queues are processed on a CPU within this set"

osgx's answer covers the main differences, but it is important to point out that it is also possible to use RSS and RPS in unison.
RSS controls the selected HW queue for receiving a stream of packets. Once certain conditions are met, an interrupt would be issued to the SW. The interrupt handler, which is defined by the NIC's driver, would be the SW starting point for processing received packets. The code there would poll the packets from the relevant receive queue, might perform initial processing and then move the packets for higher level protocol processing.
At this point RPS mechanism might be used, if configured. The driver calls netif_receive_skb(), which (eventually) will check for RPS configuration. If exists it would enqueue the SKB for continuing processing on the selected CPU:
int netif_receive_skb(struct sk_buff *skb)
{
...
return netif_receive_skb_internal(skb);
}
static int netif_receive_skb_internal(struct sk_buff *skb)
{
...
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
if (cpu >= 0) {
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
rcu_read_unlock();
return ret;
}
...
}
In some scenarios, it would be smart to use RSS and RPS together in order to avoid CPU utilization bottlenecks on the receiving side. A good example is IPoIB (IP over Infiniband). Without diving into too many details, IPoIB has a mode which can only open a single channel. This means all the incoming traffic would be handled by a single core. By properly configuring RPS, some of the processing load can be shared by multiple cores, which dramatically improves performance for this scenario.
Since transmitting was mentioned, it worth noting that packet transmission, which results from the receiving process (ACKs, forwarding), would be processed from the same core selected by netif_receive_skb().
Hope this helps.

Related

What is the difference between Memory and IO bandwidth and how do we measure each one?

What is the difference between memory and io bandwidth, and how do you measure each one?
I have so many assumptions, forgive the verbosity of this two part question.
The inspiration for these questions came from: What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth where Hadi explains:
DATA_REQ_OF_CPU is NOT used to measure memory bandwidth but i/o bandwidth.
I’m wondering if the difference between mem/io bandwidth is similar to the difference between DMA(direct memory addressing) & MMIO(memory mapped io) or if the bandwidth of both IS io bandwidth?
I’m trying to use this picture to help visualize:
(Hopefully I have this right) In x86 there are two address spaces: Memory and IO. Would IO bandwidth be the measure between cpu (or dma controller) to the io device, and then memory bandwidth would be between cpu and main memory? All data in these two scenarios running through the memory bus? Just for clarity, we all agree the definition of the memory bus is the combination of address and data bus? If so that part of the image might be a little misleading...
If we can measure IO bandwidth with Intel® Performance Counter Monitor (PCM) by utilizing the pcm-iio program, how would we measure memory bandwidth? Now I’m wondering why they would differ if running through the same wires? Unless I just have this all wrong. The github page for a lot of this test code is a bit overwhelming: https://github.com/opcm/pcm
Thank you
The DATA_REQ_OF_CPU event cannot be used to measure memory bandwdith for the following reasons:
Not all inbound memory requests from an IIO controller are serviced by a memory controller because a request could also be serviced by the LLC (or an LLC in case of multiple sockets). Note, however, on Intel processors that don't support DDIO, IO memory read requests may cause speculative read requests to be sent to memory in parallel with the LLC lookup.
The DATA_REQ_OF_CPU event has many subevents. The inbound memory metrics measured by the pcm-iio tool don't include all types of memory requests. Specifically, they don't include atomic memory reads and writes and IOMMU memory requests, which may consume memory bandwdith.
Some subevents count non-memory requests. For example, there are peer-to-peer requests (from one IIO to another).
An IO device may want to access memory on a NUMA node that is different from the node to which it's connected. In this case, it will consume memory bandwidth on a different NUMA node.
Now I realize the statement you quoted is a little ambiguous; I don't remember whether I was talking specifically about the metrics measured by pcm-iio or the event in general or whether "memory bandwdith" refers to total memory bandwidth or only the portion consumed by IO devices attached to an IIO. Although the statement interpreted in any of these ways is correct for the reasons mentioned above.
The pcm-iio tool only measures IO bandwdith. Use instead the pcm-memory tool for measuring memory bandwdith, which utilizes the performance events of the IMCs. It appears to me that the none of the PCM tools can measure memory bandwdith consumed by IO devices, which requires using the CBox events.
The main source of information on uncore performance events is the Intel uncore manuals. You'll find nice figures in the Introduction chapters of these manuals that show how the different units of a processor are connected to each other.

how to achieve best tcpip socket data-rate performance

Packing as much data as possible per tcp packet will obviously decrease the relative weight of overhead. Increasing buffer size increases robustness against peaks of CPU usage.
But what else can be done to achieve highest data rates?
Is increasing the priority of the data reader thread a good idea? If the highest priority was used, could this thread compete for CPU usage with the Network driver and actually harm performance?
Is blocking or non-blocking best in terms of achievable data rates?
At very high data rates, can overflow of the receive buffers be detected as the buffer gets to, say, 90% and trigger a high priority read?
Other techniques for high data-rate over tcpip sockets?
One way would be to use busy polling for getting data from NIC. This can improve the data rate by reducing interrupt overhead. This is done in high performance packet processing frameworks such as DPDK.
Another way would be to avoid copying packets from kernel space to userspace. I don't whether this is possible in your case. Avoiding copying packets by mapping kernel memory to userspace memory. Copying data to userspace is one of the most time consuming step in network stacks. Again this is done in DPDK.

How to cope with 320 million 272-byte UDP packets?

So, I have an incoming UDP stream composed of 272 byte packets at a data rate of about 5.12Gb/s (around 320e6 packets per second). This data is being sent by an FPGA-based custom board. The packet size is a limit of the digital design being run, so although theoretically it could be possible to increase it to make things more efficient, it would require a large amount of work. At the receiving end these packets are read and interpreted by a network thread and placed in a circular buffer shared with a buffering thread, which will copy this data to a GPU for processing.
The above setup at the receiving end could cope with 5.12Gb/s for 4096 KB packet (used on a different design) using simple recv calls, however with the current packet size I'm having a hard time keeping up with the packet flow, too much time is being "wasted" in context switching and copying small data segments from kernel space to user space. I did a quick test implementation which uses recvmmsg, however thing didn't improve by much. On average I can processes about 40% of the incoming packets.
So I was wondering whether it was possible to get a handle of the kernel's UDP data buffer for my application (mmap style), or use some sort of zero-copying from kernel to user space?
Alternatively, do you know of any other method which would reduce this overhead and be capable of performing the required processing?
This is running on a Linux machine (kernel 3.2.0-40) using C code.
There is support for mmap packet receiving in Linux.
It's not so easy to use as UDP sockets, because you will receive packets like from RAW socket.
See this for more information.

What happens after a packet is captured?

I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions

Where are possible locations of queueing/buffering delays in Linux multicast?

We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity.
The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal.
The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()), then call send(). The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.)
We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it.
For what it's worth, we're using CentOS 4.x (RHEL kernel 2.6.9).
This is a great question. On CentOS like most flavors of *nix there is a UDP receive/send buffer for every multicast socket. The size of this buffer is controlled by sysctl.conf you can view the size of your buffers by calling /sbin/sysctl -a
The below items show my default and max udp receive size in bytes. The larger these numbers the more buffering and therefor latency the network/kernel can introduce if your application is too slow in consuming the data. If you have built in good tolerance for data loss you can make these buffers very tiny and you will not see the latency build up and recovery you described above. The trade off is data loss as the buffer overflows - something you may be seeing already.
[~]$ /sbin/sysctl -a | mem
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
In most cases you need to set default = to your max unless you are controlling this when you create your socket.
the last thing you can do (depending on your kernel version) is view the UDP stats of the PID for your process or at the very least the box overall.
cat /proc/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
cat /proc/PID/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
If it wasn't clear from my post, the latency is due to your application not consuming the data fast enough and forcing the kernel to buffer traffic in the above structure. The network, kernel, and even your network card ring buffers can play a roll in latency but all those items typically only add a few milliseconds.
Let me know your thoughts and I can give you more information on where to look in your app to squeeze some more performance.
Packets can queue up in the send and receive side kernel, the NIC and the networking infrastructure. You will find a plethora of items you can test and tweak.
For the NIC you can usually find interrupt coalescing parameters - how long the NIC will wait before notifying the kernel or sending to the wire whilst waiting to batch packets.
For Linux you have the send and receive "buffers", the larger they are the more likely you are to experience higher latency as packets get handled in batched operations.
For the architecture and Linux version you have to be aware of how expensive context switches are and whether there are locks or pre-emptive scheduling enabled. Consider minimizing the number of applications running, using process affinity to lock processes to particular cores.
Don't forget timing, the Linux kernel version you are using has pretty terrible accuracy on the gettimeofday() clock (2-4ms) and is quite an expensive call. Consider using alternatives such as reading from the core TSC or an external HPET device.
Diagram from Intel:
alt text http://www.theinquirer.net/IMG/142/96142/latency-580x358.png?1272514422
If you decide you need to capture packets in the production environment, it may be worth looking at using monitor ports on your switches and capture the packets using non-production machines. That'll also allow you to capture the packets on multiple points across the transmission path and compare what you're seeing.

Resources