I want to write a high performance DNS server using Intel DPDK. How can I use Intel DPDK to process TCP packets effectively?
Sure, implement a net stack on DPDK is the solution. But it's too complicated.
As DNS server handles much more UDP queries than TCP queries, I intend to use DPDK to handle UDP queries and use linux net stack to handle TCP queries.
How can I do this on a single machine?
What Gabe has suggested is correct -- However there is a better way to achieve what you really want. You need to use a bifurcated driver.
The problem with using a KNI as suggested by Gabe is this:
Its your software in user space that will decide what it needs to retain (UDP) and what all needs to be routed via a network stack (TCP). You will then pass them to the kernel via DPDK software rings that would consume CPU and memory cycles.
There will be a memory copy between your mbuf and the kernel socket buffer which will affect your KNI performance.
Also note that if you handle UDP in your user space then you need to construct the L2 header before pushing the packet out. This means that you also perhaps need to trap all ARP packets so that you can build your ARP cache, as you will need that to construct the L2 header.
Look at the KNI documentation and Kernel NIC Interface example in the DPDK.
After allocating your KNI device, in you main DPDK polling loop you will pull packets off of the NIC. You'll have to parse the header yourself, or you could get fancy and set up multiple RX queues and then use RSS 5-tuple filters to send UDP packets to your processing queue, and the rest to the default queue.
Anyway, regardless of the method chosen, if it is a UDP packet
you can handle it your self (like you requested); otherwise, you will queue the packet on to the KNI thread. You will also need the other half of the KNI interface as well, where you poll packets off of the KNI thread, and send it out the interface.
This is exactly the way we do it in our application where we still want a linux networking stack for all operations other than our specific traffic.
Related
Edit : I am using DPDK version DPDK 21.11.2 (LTS)
I'm working on implementing a DPDK based DNS server. I have researched and studied both concepts. In my understanding, we are bypassing the kernel with DPDK and thus losing all the processing ability of the kernel. DPDK is limited till Layer 2, as the NIC we bind to it loses its IP. To implement layer 3 protocol, is l3fwd the best option?
I read up on KNI as well, as it shows a simple command to assign IP. But that is no longer available in latest versions
Basically what I want to do is, send a DNS packet to the DPDK NIC (via another VM). This current DPDK VM then processes it and send a answer packet back through the same NIC. What is the least complicated way to do this?
I have only tried basic DPDK examples like basicfwd and l2fwd where simple packets are sent and received by the same VM.
I want to bypass the Linux network stack and transform raw packets to my custom codes in userland and handle them in there.
I know that you can make your custom drivers using pf-rings or DPDK and others. But I can not understand why should I make these kinds of drivers while I can use the Netfilter and hook my module to NF_IP_PRE_ROUTING state and send the packets to userland.
It would be a great help for me if anyone can explain me the main differences between them.
There is a huge difference between DPDK and Netfilter hooks. When using Netfilter / hooking NF_IP_PRE_ROUTING you hijack the packet flow and copy packets form kernel space to user space. This copy causes a large overhead.
When using DPDK you're actually mapping you network card's packet buffers to a userspace memory area. Meaning that instead of the kernel getting an interrupt from the NIC, then passing it through all its queues until it reaches NF_IP_PRE_ROUTING which in turn will copy the packer to userland upon request, DPDK offers you the possibility to access the mapped packet buffers directly from userspace, bypassing all meta-handling by the kernel, effectively improving performance (at the cost of code complexity and security).
There are a variety of techniques to grab raw packets and deliver them to a userspace application. The devil as usual in the details.
If all we need is to deliver packets to a userspace application -- there is no difference what solution to use. Libpcap, or tun/taps, or Netfilter, or pf-ring, or whatever. All will do just fine.
But if we need to process 100 million packets per second (~30 CPU cycles per packet on 3GHz) -- I don't think we have other options at the moment but DPDK. Google for "DPDK performance report" and have a look.
DPDK is a framework which works well on many platforms (x86, ARM, POWER etc) and supports many NICs. There is no need to write a driver, the support for the most popular NICs is already there.
There is also a support to manage CPU cores, huge pages, memory buffers, encryption, IP fragmentation etc etc. All designed to be able to forward 100 Mpps. If we need that performance...
Consider the following:
A server with multiple processes listening to the same multicast address. The processes are able to handle the packets at varying rates.
From observation, the policy for dropping packets when the slowest process forces the rx buffer to queue is to drop the newest incoming packet. This results in all processes losing data as opposed to the slow process losing data.
I haven't been able to find any documentation outlining the policy in the linux kernel for this situation.
Is anyone aware of a way to drop the oldest packet in the buffer and allow the newest to be queued for processing?
FYI kernel 2.6.32-504.16.2.el6.x86_64
I am working with the Xilinx distribution of Linux on a Zynq 7000 board. This has two ARM processors, some L2 cache, a DRAM interface, and a large amount of FPGA fabric. Our appliance collects data being processed by the FPGA and then sends it over the gigabit network to other systems.
One of the services we need to support on this appliance is SNMP, which relies on UDP datagrams, and although SNMP does have TCP support, we cannot force the clients to use that.
What I am finding is that this system is losing almost all SNMP requests.
It is important to note that neither the network nor the CPUs are being overloaded. The data rate isn't particularly high, and the CPUs are usually somewhere around 30% load. Plus, we're using SNMP++ and Agent++ libraries for SNMP, so we have control over those, so it's not a problem with a system daemon breaking. However, if we do stop the processing and network activity, SNMP requests are not lost. SNMP is being handled in its own thread, and we've made sure to keep requests rare and spread-out so that there really should be no more than one request buffered at any one time. With the low CPU load, there should be no problem context-switching to the receiving process to service the request.
Since it's not a CPU or ethernet bandwidth problem, my best guess is that the problem lies in the Linux kernel. Despite the low network load, I'm guessing that there are limited network stack buffers being overfilled, and this is why it's dropping UDP datagrams.
When googling this, I find examples of how to use netstat to report lost packets, but that doesn't seem to work on this system, because there is no "-s" option. How can I monitor these packet drops? How can I diagnose the cause? How can I tune kernel parameters to minimize this loss?
Thanks!
Wireshark or tcpdump is a good approach.
You may want to take a look at the settings in /proc/sys/net/ipv4/ or try an older kernel (3.x instead of 4.x). We had an issue with tcp connections on the Zynq with the 4.4 kernel but this could be seen in the system logs (A warning regarding SYN cookies and possible flooding).
I have the following requirment,
I have a Linux PC connected directly to an embedded board.
The Linux PC receives IP traffic from the Internet - it needs to forward this to the embedded board. However the embedded board does not have ability to reassemble IP fragments. Currently what we do is receive the reassembled packet in the linux pc and then sendto() to the emmbeded board. However given the high load of traffic this consumes too much CPU cycles in the Linux PC - since this invovles a copy from kernel space to user space and again the same packet is copied from user space to kernel space.
Is there a way for the kernel to reassemble the fragements and IP forward it to the embedded board without the packet having to come to user space? Note: I have the flexibility to make the destination IP of the IP packets as either the Linux PC or the embedded board.
Thanks
Broadly speaking, no this is not built into the kernel, particularly if your reassembled packet exceeds the MTU size and therefore cannot be transmitted to your embedded board. If you wanted to do it, I'd suggest routing via a tun device and reassembling in user space, or (if you are just using tcp) using any old tcp proxy. If written efficiently it's hard to see why a linux PC would not be able to keep up with this if the embedded board can manage to process the output. If you insist on using the kernel, I think there is a tcp splice technique (see kernel-based (Linux) data relay between two TCP sockets) though whether that works at a segment level and thus does not reassemble, I don't know.
However, do you really need it? See:
http://en.wikipedia.org/wiki/Path_MTU_Discovery
Here tcp sessions are sent with the DF bit set precisely so no fragmentation occurs. This means that most such tcp sessions won't actually need to support fragmentation.
Based on the title of the question, it appears you need to perform reassembly on the intermediate node (linux device). This doesn't mean you have to do it in the kernel space.
Take a look at DPDK. It is an opensource dataplane development kit. It may sound complicated, but all it does is use Poll Mode Drivers to get packets up to the user space with out the copying and interrupt overhead.
Please not, it uses poll mode drivers and will take up CPU cycles. You can use dpdk on a x86_64 hardware if you are ready to give up a couple of cores assuming you also want to fragment the packets in the reverse path.
Take a look at the sample application guide of DPDK for packet fragmentation and reassembly.