differences between DPDK and Netfilter - linux

I want to bypass the Linux network stack and transform raw packets to my custom codes in userland and handle them in there.
I know that you can make your custom drivers using pf-rings or DPDK and others. But I can not understand why should I make these kinds of drivers while I can use the Netfilter and hook my module to NF_IP_PRE_ROUTING state and send the packets to userland.
It would be a great help for me if anyone can explain me the main differences between them.

There is a huge difference between DPDK and Netfilter hooks. When using Netfilter / hooking NF_IP_PRE_ROUTING you hijack the packet flow and copy packets form kernel space to user space. This copy causes a large overhead.
When using DPDK you're actually mapping you network card's packet buffers to a userspace memory area. Meaning that instead of the kernel getting an interrupt from the NIC, then passing it through all its queues until it reaches NF_IP_PRE_ROUTING which in turn will copy the packer to userland upon request, DPDK offers you the possibility to access the mapped packet buffers directly from userspace, bypassing all meta-handling by the kernel, effectively improving performance (at the cost of code complexity and security).

There are a variety of techniques to grab raw packets and deliver them to a userspace application. The devil as usual in the details.
If all we need is to deliver packets to a userspace application -- there is no difference what solution to use. Libpcap, or tun/taps, or Netfilter, or pf-ring, or whatever. All will do just fine.
But if we need to process 100 million packets per second (~30 CPU cycles per packet on 3GHz) -- I don't think we have other options at the moment but DPDK. Google for "DPDK performance report" and have a look.
DPDK is a framework which works well on many platforms (x86, ARM, POWER etc) and supports many NICs. There is no need to write a driver, the support for the most popular NICs is already there.
There is also a support to manage CPU cores, huge pages, memory buffers, encryption, IP fragmentation etc etc. All designed to be able to forward 100 Mpps. If we need that performance...

Related

Ethernet frames from NIC

I'm searching for help and an opinion-advice for a network project, in which I'm working lately. This requires a Linux machine to be a passive network appliance.
Network packets come in from one network interface and come out from another interface ( net--eth0-->Linux PC--eth1-->net) without making any modifications on data.
The application, which is going to run on the Linux system, will change only the order of the packets. It is going to be a "silly" network emulator application.
The first implementation was made with RAW sockets, where read() is called every time a packet arrives to user space and write() is called when an Ethernet packet should be sent down to the NIC.
I would like to know if there is a more practical and direct way than RAW sockets, bypassing Linux's network stack.
If what you want is to bypass the kernel, DPDK in Linux and NetMap in FreeBSD are options to do just that.
Indeed this can be done in dpdk in Linux. There are l3fw and l2fwd sample applications in the examples folder of the dpdk tree, which may inspire you. Also consider using vpp, a fd.io project hosted by Linux Foundation, which can use dpdk.
Rami Rosen

Embedded Linux on Zynq 7000, dropping almost all UDP packets

I am working with the Xilinx distribution of Linux on a Zynq 7000 board. This has two ARM processors, some L2 cache, a DRAM interface, and a large amount of FPGA fabric. Our appliance collects data being processed by the FPGA and then sends it over the gigabit network to other systems.
One of the services we need to support on this appliance is SNMP, which relies on UDP datagrams, and although SNMP does have TCP support, we cannot force the clients to use that.
What I am finding is that this system is losing almost all SNMP requests.
It is important to note that neither the network nor the CPUs are being overloaded. The data rate isn't particularly high, and the CPUs are usually somewhere around 30% load. Plus, we're using SNMP++ and Agent++ libraries for SNMP, so we have control over those, so it's not a problem with a system daemon breaking. However, if we do stop the processing and network activity, SNMP requests are not lost. SNMP is being handled in its own thread, and we've made sure to keep requests rare and spread-out so that there really should be no more than one request buffered at any one time. With the low CPU load, there should be no problem context-switching to the receiving process to service the request.
Since it's not a CPU or ethernet bandwidth problem, my best guess is that the problem lies in the Linux kernel. Despite the low network load, I'm guessing that there are limited network stack buffers being overfilled, and this is why it's dropping UDP datagrams.
When googling this, I find examples of how to use netstat to report lost packets, but that doesn't seem to work on this system, because there is no "-s" option. How can I monitor these packet drops? How can I diagnose the cause? How can I tune kernel parameters to minimize this loss?
Thanks!
Wireshark or tcpdump is a good approach.
You may want to take a look at the settings in /proc/sys/net/ipv4/ or try an older kernel (3.x instead of 4.x). We had an issue with tcp connections on the Zynq with the 4.4 kernel but this could be seen in the system logs (A warning regarding SYN cookies and possible flooding).

Ip reassembly at intermediate node

I have the following requirment,
I have a Linux PC connected directly to an embedded board.
The Linux PC receives IP traffic from the Internet - it needs to forward this to the embedded board. However the embedded board does not have ability to reassemble IP fragments. Currently what we do is receive the reassembled packet in the linux pc and then sendto() to the emmbeded board. However given the high load of traffic this consumes too much CPU cycles in the Linux PC - since this invovles a copy from kernel space to user space and again the same packet is copied from user space to kernel space.
Is there a way for the kernel to reassemble the fragements and IP forward it to the embedded board without the packet having to come to user space? Note: I have the flexibility to make the destination IP of the IP packets as either the Linux PC or the embedded board.
Thanks
Broadly speaking, no this is not built into the kernel, particularly if your reassembled packet exceeds the MTU size and therefore cannot be transmitted to your embedded board. If you wanted to do it, I'd suggest routing via a tun device and reassembling in user space, or (if you are just using tcp) using any old tcp proxy. If written efficiently it's hard to see why a linux PC would not be able to keep up with this if the embedded board can manage to process the output. If you insist on using the kernel, I think there is a tcp splice technique (see kernel-based (Linux) data relay between two TCP sockets) though whether that works at a segment level and thus does not reassemble, I don't know.
However, do you really need it? See:
http://en.wikipedia.org/wiki/Path_MTU_Discovery
Here tcp sessions are sent with the DF bit set precisely so no fragmentation occurs. This means that most such tcp sessions won't actually need to support fragmentation.
Based on the title of the question, it appears you need to perform reassembly on the intermediate node (linux device). This doesn't mean you have to do it in the kernel space.
Take a look at DPDK. It is an opensource dataplane development kit. It may sound complicated, but all it does is use Poll Mode Drivers to get packets up to the user space with out the copying and interrupt overhead.
Please not, it uses poll mode drivers and will take up CPU cycles. You can use dpdk on a x86_64 hardware if you are ready to give up a couple of cores assuming you also want to fragment the packets in the reverse path.
Take a look at the sample application guide of DPDK for packet fragmentation and reassembly.

How can I reduce system call overhead when sending a large number of UDP packets? (both Windows & Linux)

For example, I am sending 100000 UDP packets on Windows. For each packet, I need to call WSASendTo() once, so probably a lot of system call overhead is introduced. Is there a way to do bulk sending and reduce this overhead? I could not find a solution for Windows after googling for a while. Also, I would like to know if this is possible on Linux. Thanks.
On Windows you can use the new Windows Registered I/O API (RIO) on Server 2012 and Windows 8 and later.
I've written quite a lot about it here and have done several performance comparisons with the previous APIs that are available on Windows. The performance tests can be found here.
In summary: "The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. The increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go."
The results of my performance tests seem to imply that it works.
Use TransmitPackets() in Windows.
Microsoft has just recently adopted the QUIC protocol. To that end Microsoft had to improve the performance of UDP in Windows.
Therefore we now have a new "UDP send segmentation" and "UDP receive coalescing" API in Windows.
This solves the call overhead problem and takes advantage of hardware offloading on NICs which support it. This is the fastest possible method of transmitting UDP today. But with the caveat that it mostly requires all UDP packets in a call to have the same size.
Read more about it here:
https://techcommunity.microsoft.com/t5/networking-blog/making-msquic-blazing-fast/ba-p/2268963

How can I use DPDK to write a DNS server?

I want to write a high performance DNS server using Intel DPDK. How can I use Intel DPDK to process TCP packets effectively?
Sure, implement a net stack on DPDK is the solution. But it's too complicated.
As DNS server handles much more UDP queries than TCP queries, I intend to use DPDK to handle UDP queries and use linux net stack to handle TCP queries.
How can I do this on a single machine?
What Gabe has suggested is correct -- However there is a better way to achieve what you really want. You need to use a bifurcated driver.
The problem with using a KNI as suggested by Gabe is this:
Its your software in user space that will decide what it needs to retain (UDP) and what all needs to be routed via a network stack (TCP). You will then pass them to the kernel via DPDK software rings that would consume CPU and memory cycles.
There will be a memory copy between your mbuf and the kernel socket buffer which will affect your KNI performance.
Also note that if you handle UDP in your user space then you need to construct the L2 header before pushing the packet out. This means that you also perhaps need to trap all ARP packets so that you can build your ARP cache, as you will need that to construct the L2 header.
Look at the KNI documentation and Kernel NIC Interface example in the DPDK.
After allocating your KNI device, in you main DPDK polling loop you will pull packets off of the NIC. You'll have to parse the header yourself, or you could get fancy and set up multiple RX queues and then use RSS 5-tuple filters to send UDP packets to your processing queue, and the rest to the default queue.
Anyway, regardless of the method chosen, if it is a UDP packet
you can handle it your self (like you requested); otherwise, you will queue the packet on to the KNI thread. You will also need the other half of the KNI interface as well, where you poll packets off of the KNI thread, and send it out the interface.
This is exactly the way we do it in our application where we still want a linux networking stack for all operations other than our specific traffic.

Resources