Ip reassembly at intermediate node - linux

I have the following requirment,
I have a Linux PC connected directly to an embedded board.
The Linux PC receives IP traffic from the Internet - it needs to forward this to the embedded board. However the embedded board does not have ability to reassemble IP fragments. Currently what we do is receive the reassembled packet in the linux pc and then sendto() to the emmbeded board. However given the high load of traffic this consumes too much CPU cycles in the Linux PC - since this invovles a copy from kernel space to user space and again the same packet is copied from user space to kernel space.
Is there a way for the kernel to reassemble the fragements and IP forward it to the embedded board without the packet having to come to user space? Note: I have the flexibility to make the destination IP of the IP packets as either the Linux PC or the embedded board.
Thanks

Broadly speaking, no this is not built into the kernel, particularly if your reassembled packet exceeds the MTU size and therefore cannot be transmitted to your embedded board. If you wanted to do it, I'd suggest routing via a tun device and reassembling in user space, or (if you are just using tcp) using any old tcp proxy. If written efficiently it's hard to see why a linux PC would not be able to keep up with this if the embedded board can manage to process the output. If you insist on using the kernel, I think there is a tcp splice technique (see kernel-based (Linux) data relay between two TCP sockets) though whether that works at a segment level and thus does not reassemble, I don't know.
However, do you really need it? See:
http://en.wikipedia.org/wiki/Path_MTU_Discovery
Here tcp sessions are sent with the DF bit set precisely so no fragmentation occurs. This means that most such tcp sessions won't actually need to support fragmentation.

Based on the title of the question, it appears you need to perform reassembly on the intermediate node (linux device). This doesn't mean you have to do it in the kernel space.
Take a look at DPDK. It is an opensource dataplane development kit. It may sound complicated, but all it does is use Poll Mode Drivers to get packets up to the user space with out the copying and interrupt overhead.
Please not, it uses poll mode drivers and will take up CPU cycles. You can use dpdk on a x86_64 hardware if you are ready to give up a couple of cores assuming you also want to fragment the packets in the reverse path.
Take a look at the sample application guide of DPDK for packet fragmentation and reassembly.

Related

differences between DPDK and Netfilter

I want to bypass the Linux network stack and transform raw packets to my custom codes in userland and handle them in there.
I know that you can make your custom drivers using pf-rings or DPDK and others. But I can not understand why should I make these kinds of drivers while I can use the Netfilter and hook my module to NF_IP_PRE_ROUTING state and send the packets to userland.
It would be a great help for me if anyone can explain me the main differences between them.
There is a huge difference between DPDK and Netfilter hooks. When using Netfilter / hooking NF_IP_PRE_ROUTING you hijack the packet flow and copy packets form kernel space to user space. This copy causes a large overhead.
When using DPDK you're actually mapping you network card's packet buffers to a userspace memory area. Meaning that instead of the kernel getting an interrupt from the NIC, then passing it through all its queues until it reaches NF_IP_PRE_ROUTING which in turn will copy the packer to userland upon request, DPDK offers you the possibility to access the mapped packet buffers directly from userspace, bypassing all meta-handling by the kernel, effectively improving performance (at the cost of code complexity and security).
There are a variety of techniques to grab raw packets and deliver them to a userspace application. The devil as usual in the details.
If all we need is to deliver packets to a userspace application -- there is no difference what solution to use. Libpcap, or tun/taps, or Netfilter, or pf-ring, or whatever. All will do just fine.
But if we need to process 100 million packets per second (~30 CPU cycles per packet on 3GHz) -- I don't think we have other options at the moment but DPDK. Google for "DPDK performance report" and have a look.
DPDK is a framework which works well on many platforms (x86, ARM, POWER etc) and supports many NICs. There is no need to write a driver, the support for the most popular NICs is already there.
There is also a support to manage CPU cores, huge pages, memory buffers, encryption, IP fragmentation etc etc. All designed to be able to forward 100 Mpps. If we need that performance...

Ethernet frames from NIC

I'm searching for help and an opinion-advice for a network project, in which I'm working lately. This requires a Linux machine to be a passive network appliance.
Network packets come in from one network interface and come out from another interface ( net--eth0-->Linux PC--eth1-->net) without making any modifications on data.
The application, which is going to run on the Linux system, will change only the order of the packets. It is going to be a "silly" network emulator application.
The first implementation was made with RAW sockets, where read() is called every time a packet arrives to user space and write() is called when an Ethernet packet should be sent down to the NIC.
I would like to know if there is a more practical and direct way than RAW sockets, bypassing Linux's network stack.
If what you want is to bypass the kernel, DPDK in Linux and NetMap in FreeBSD are options to do just that.
Indeed this can be done in dpdk in Linux. There are l3fw and l2fwd sample applications in the examples folder of the dpdk tree, which may inspire you. Also consider using vpp, a fd.io project hosted by Linux Foundation, which can use dpdk.
Rami Rosen

Embedded Linux on Zynq 7000, dropping almost all UDP packets

I am working with the Xilinx distribution of Linux on a Zynq 7000 board. This has two ARM processors, some L2 cache, a DRAM interface, and a large amount of FPGA fabric. Our appliance collects data being processed by the FPGA and then sends it over the gigabit network to other systems.
One of the services we need to support on this appliance is SNMP, which relies on UDP datagrams, and although SNMP does have TCP support, we cannot force the clients to use that.
What I am finding is that this system is losing almost all SNMP requests.
It is important to note that neither the network nor the CPUs are being overloaded. The data rate isn't particularly high, and the CPUs are usually somewhere around 30% load. Plus, we're using SNMP++ and Agent++ libraries for SNMP, so we have control over those, so it's not a problem with a system daemon breaking. However, if we do stop the processing and network activity, SNMP requests are not lost. SNMP is being handled in its own thread, and we've made sure to keep requests rare and spread-out so that there really should be no more than one request buffered at any one time. With the low CPU load, there should be no problem context-switching to the receiving process to service the request.
Since it's not a CPU or ethernet bandwidth problem, my best guess is that the problem lies in the Linux kernel. Despite the low network load, I'm guessing that there are limited network stack buffers being overfilled, and this is why it's dropping UDP datagrams.
When googling this, I find examples of how to use netstat to report lost packets, but that doesn't seem to work on this system, because there is no "-s" option. How can I monitor these packet drops? How can I diagnose the cause? How can I tune kernel parameters to minimize this loss?
Thanks!
Wireshark or tcpdump is a good approach.
You may want to take a look at the settings in /proc/sys/net/ipv4/ or try an older kernel (3.x instead of 4.x). We had an issue with tcp connections on the Zynq with the 4.4 kernel but this could be seen in the system logs (A warning regarding SYN cookies and possible flooding).

How does changing a network interface's MAC address with ifconfig affect a NIC that isn't in promiscuous mode?

How does changing a network interface's MAC address (ifconfig eth2 hw ether BB:BB:BB:BB:BB:BB) affect a NIC that isn't in promiscuous mode?
When I use ifconfig to change my network card's MAC, it is reverted to the original MAC after a reboot.
As far as I understand, it means that the network card has its original MAC saved in its own nonvolatile memory and Linux reads this original MAC every time it boots.
It must mean that ifconfig does not change the MAC's value saved in NIC's own nonvolatile memory. This value is left untouched.
Nevertheless, when I change the MAC with ifconfig, Linux starts sending Ethernet frames with this new MAC as source-MAC.
From what I have learnt, a NIC that is not in so called "promiscuous mode" rejects all Ethernet frames whose destination MAC addresses differ from the NIC's MAC address (unless it's broadcast or multicast).
It means that Linux will not even see those frames. They will be dropped before reaching the CPU.
I suppose the NIC does this job by checking the Ethernet frame's destination MAC address against the NIC's MAC address saved in NIC's own nonvolatile memory.
Now there comes an issue that I don't understand.
Since NIC uses its internally saved original MAC to decide whether to drop a frame before passing it to the CPU, and Linux may use a totally different MAC as source MAC for outgoing frames, then how does a response to such frames reach Linux?
What do I misunderstand in this topic?
I will present an example to better describe what I mean.
NIC has AA:AA:AA:AA:AA:AA stored in its own internal nonvolatile memory as its original MAC.
It is not in promiscuous mode, so it prevents all frames not containing AA:AA:AA:AA:AA:AA as the destination MAC from reaching the CPU (and Linux).
Now someone types: ifconfig eth2 hw ether BB:BB:BB:BB:BB:BB.
From now on, outgoing frames sent by Linux from this interface will have BB:BB:BB:BB:BB:BB as source MAC.
Eventually another host will reply to this frame by sending a frame with BB:BB:BB:BB:BB:BB as destination MAC.
Such a frame will arrive to the first host's NIC. What will the NIC do now? It will compare BB:BB:BB:BB:BB:BB with AA:AA:AA:AA:AA:AA (stored internally in NIC's ROM) and decide not to pass it to the CPU?!? So the frame will never reach Linux?
Where's the catch?
The MAC address in modern network interfaces is a volatile value held in a configuration register of the chip. That's the only value that is used in chip's operation outside of initialization. This value is initialized on start up from non-volatile memory - either by the chip hardware itself, or by the driver, depending on the particular chip's design.
The non-volatile value is not used for anything but the initialization. It is certainly not used to filter out the incoming packets - the nonvolatile memory is much too slow for that. During network interface's operation, the nonvolatile memory is idle and is not used.
The promiscuity of the NIC had nothing to do with your question. As soon as you assign a new MAC, it takes effect, no matter what the nonvolatile memory contents might be, and no matter whether the interface is promiscuous or not.
Finally, the non-volatile memory in a NIC interface is optional. On many mobile systems (say laptops) there's no nonvolatile memory dedicated to the NIC. The MAC address of the NIC is held in the nonvolatile memory that stores other system-specific nonvolatile configuration data. This saves power and money.

How can I use DPDK to write a DNS server?

I want to write a high performance DNS server using Intel DPDK. How can I use Intel DPDK to process TCP packets effectively?
Sure, implement a net stack on DPDK is the solution. But it's too complicated.
As DNS server handles much more UDP queries than TCP queries, I intend to use DPDK to handle UDP queries and use linux net stack to handle TCP queries.
How can I do this on a single machine?
What Gabe has suggested is correct -- However there is a better way to achieve what you really want. You need to use a bifurcated driver.
The problem with using a KNI as suggested by Gabe is this:
Its your software in user space that will decide what it needs to retain (UDP) and what all needs to be routed via a network stack (TCP). You will then pass them to the kernel via DPDK software rings that would consume CPU and memory cycles.
There will be a memory copy between your mbuf and the kernel socket buffer which will affect your KNI performance.
Also note that if you handle UDP in your user space then you need to construct the L2 header before pushing the packet out. This means that you also perhaps need to trap all ARP packets so that you can build your ARP cache, as you will need that to construct the L2 header.
Look at the KNI documentation and Kernel NIC Interface example in the DPDK.
After allocating your KNI device, in you main DPDK polling loop you will pull packets off of the NIC. You'll have to parse the header yourself, or you could get fancy and set up multiple RX queues and then use RSS 5-tuple filters to send UDP packets to your processing queue, and the rest to the default queue.
Anyway, regardless of the method chosen, if it is a UDP packet
you can handle it your self (like you requested); otherwise, you will queue the packet on to the KNI thread. You will also need the other half of the KNI interface as well, where you poll packets off of the KNI thread, and send it out the interface.
This is exactly the way we do it in our application where we still want a linux networking stack for all operations other than our specific traffic.

Resources