UDP packet greater than 1500 bytes dropped - linux

I'm developing a tftp client and server and I want to dynamically select the udp payload size to boost transfer performance.
I have tested it with two linux machines ( one has a gigabit ethernet card, the other a fast ethernet one ). I changed the MTU of the gigabit card to 2048 bytes and left the other to 1500.
I have used setsockopt(sockfd, IPPROTO_IP, IP_MTU_DISCOVER, &optval, sizeof(optval)) to set the MTU_DISCOVER flag to IP_PMTUDISC_DO.
From what I have read this option should set the DF bit to one and so it should be possible to find the minimum MTU of the network ( the MTU of the host that has the lowest MTU ). However this thing only gives me an error when I send a packet which size is bigger than the MTU of the machine from which I'm sending packets.
Also the other machine ( the server in this case ) doesn't receive the oversized packets ( the server has a MTU of 1500 ). All the UDP packets are dropped, the only way is to send packets of 1472 bytes.
Why the hosts do this? From what I have read, if I send a packet larger than MTU, the ip layer should fragment it.

I fail to see the problem. You are setting the "don't fragment" bit, and you send a package smaller than the sending host's MTU, but larger than the receiving host's MTU. Of course nobody will fragment here (doing so would violate the DF bit). Instead, the sending host should get an ICMP message back.
Edit: IP specifies that an ICMP error message type 3 (destination unreachable) code 4 (Fragmentation Required but DF Bit Is Set) is sent to the originating host at the point where the fragmentation would have occurred. The TCP layer handles this on its own for PMTU discovery. On connection-less sockets, Linux reports the error in the socket's error queue if the IP_RECVERR option is activated; see ip(7).

That "DF bit" you're setting, stands for "Don't Fragment". The IP layer should not be expected to fragment packets when you've told it not to.

It is not correct to run hosts with different interface MTUs on the same subnet1.
This is a host/network misconfiguration, and IP path MTU discovery is not expected to work correctly in this situation.
If you wish to test your application's path MTU discovery, you will need to set up multiple subnets connected by a router2, with different MTUs. In this situation, the router is the device that will pick up the MTU mismatch, and send back an ICMP "Fragmentation Needed" error.
1. Well, technically, same broadcast domain.
2. The devices sold as "home routers" are really router/switches - they route between the WAN and the LAN, but switch between the ethernet ports on the LAN. This isn't sufficient to separate networks with different MTUs.

Related

Linux TCP: packet segmentation?

I am working on a virtualization environment (Linux over HyperV). The Linux driver for the virtual NIC supports TSO and GSO (tcp segmentation is ON and generic segmentation is ON).
Now, I create TCP socket and the send buffer set to 128K.
But based on ifconfig data (TX bytes and TX packets), the average packet size is about 11 K.
So my question is, where is my packet be segmented (from 128K to 11K)? How do I control/configure this in socket options or TCP options?
thanks!
===========EDIT==================
I have an application which can reach 8Gbps throughput in a 10G network with 32 TCP connections - in this case, the average packet size is about 20 Kbytes which is pretty good; but when I increased the TCP connections to 256, then the throughput is just about 1Gbps as the packet size on NIC is down to about 3 KBytes.
I know the packet size is critical to the performance as the cost of processing traffic is per packet, not per bytes, so the packet on NIC, it is better if bigger.
SO, MY QUESTION IS: how do I increase the TCP packet size? Is there any TCP settings control this?
Your question seems a little bit confusing, but there are a number of settings that you need to play with to get 10GigE to work right on Linux.
See here:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/
Setting the socket option SO_SNDBUF, SO_RCVBUF might help, but the TCP IP does not guarantees the chunks size when received / sent.

TCP_MAXSEG inaccurate? (Was: Linux path MTU probing not working on accept():ed socket if requested using setsockopt())

Question update: In addition to the problem below, it seems our client/server application using the Linux PLPMTUD mechanism gets too large path MTU. Has anyone seen this, i.e. actual path MTU being 1500, but getsockopt() w TCP_MAXSEG returning the MTU:s of the endpoints, in our case 3000? I have tried turning of GRO, GSO and TSO with ethtool but the error persists. Normal ping only manages to push through packets 1472 bytes or smaller. Also worth mentioning is that PLPMTUD works perfectly for smaller MTU:s. For example, w endpoints at 1500 MTU and one interface of the intermediate router set to e.g 1200 MTU, the kernel TCP probes and reports correct TCP_MAXSEG (1200 - headers).
I am using the Linux RFC4821-compliant packetization layer path MTU discovery in an application. Basically, the client does a setsockopt on a TCP socket:
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &sopt, sizeof(sopt));
with option value set to IP_PMTUDISC_PROBE. The setsockopt() does not return an error.
The client sends large tcp packets to a discard server, and the path MTU is calibrated by Linux kernel - tcpdump shows tcp packets with DF bit set being sent, packet size varies until the kernel knows the path MTU. However, to get this to work in the other direction (listening server accept:ing connections from clients, sending data and calibrating PMTU in direction from server to client) I have to set the global option for tcp path mtu discovery, /proc/sys/net/ipv4/tcp_mtu_probing. If I do not, server will stupidly continue to send too large packets, which get discarded by an intermediate router without ICMP sent back. Both endpoints have an MTU set to 3000, while the intermediate hops have MTU 1500.
I hope someone has an idea on what goes wrong. If more info is needed, let me know and I edit the question. Problems exist on both Linux kernel 4.2.0 and 3.19.0, both are stock Kubuntu LTS kernels. (x86/x86-64)
I do set the same socket option server-side as well, on all accept:ed sockets, before sending data in reverse direction.
FWIW, I have found workarounds/solutions for the problems, will do more testing but shortly describe my findings here, in case it helps someone else.
The problem with not being able to set path mtu discovery per socket was fixed, brute force, by enabling it system-wide during execution of my program, then disabling again.
The second problem, of incorrect path mtu returned by getsockopt TCP_MAXSEG, was fixed by waiting for TCP ACK of sent TCP data, also using getsockopt (tcp_info.tcpi_unacked). That way, I can be sure that probing has finished before I get TCP_MAXSEG.
Finally, there was a patchset for improving path mtu probing accuracy merged to mainline Linux kernel in March 2015. Without those patches, the probing is very imprecise. Patchset is part of 4.1.y-series kernels and onward.

how is TCP's checksum calculated when we use tcpdump to capture packets which we send out

I am trying to generate a series of packets to simulate the TCP 3-way handshake procedure, my first step is to capture the real connecting packets, and try to re-send the same packets from the same machine, but it didn't work at first.
finally I found it out that the packet I captured with tcpdump is not exactly what my computer sent out, the TCP's checksum field is changed and it lead me to thinkk that I can establish a tcp connection even the TCP checksum is incorrect.
so my question is how is the checksum field calculated? is it modified by tcpdump or hardware? why is it changed? Is it a bug of tcpdump? or it's because the calculation is omitted.
the following is the screenshot I captured from my host machine and a virtual machinne, you can see that the same packet captured on differnet machine are all the same except for the TCP checksum.
and the small window is my virtual machine, I used command "ssh 10.82.25.138" from the host to generate these packets
What you are seeing may be the result of checksum offloading. To quote from the wireshark wiki (http://wiki.wireshark.org/CaptureSetup/Offloading):
Most modern operating systems support some form of network offloading,
where some network processing happens on the NIC instead of the CPU.
Normally this is a great thing. It can free up resources on the rest
of the system and let it handle more connections. If you're trying to
capture traffic it can result in false errors and strange or even
missing traffic.
On systems that support checksum offloading, IP, TCP, and UDP
checksums are calculated on the NIC just before they're transmitted on
the wire. In Wireshark these show up as outgoing packets marked black
with red Text and the note [incorrect, should be xxxx (maybe caused by
"TCP checksum offload"?)].
Wireshark captures packets before they are sent to the network
adapter. It won't see the correct checksum because it has not been
calculated yet. Even worse, most OSes don't bother initialize this
data so you're probably seeing little chunks of memory that you
shouldn't.
Although this is for wireshark, the same principle applies. In your host machine, you see the wrong checksum because it just hasn't been filled in yet. It looks right on the guest, because before it's sent out on the "wire" it is filled in. Try disabling checksum offloading on the interface which is handling this traffic, e.g.:
ethtool -K eth0 rx off tx off
if it's eth0.

How to set the maximum TCP Maximum Segment Size on Linux?

In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.
Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.
ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500
The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?
MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.
You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.

How frequently IP packets are fragmented at the source host?

I know that if IP payload > MTU then routers usually fragment the IP packet. Finally all the fragmented packets are assembled at the destination using the fields IP-ID, IP fragment offsets and fragmentation flags. Max length of IP payload is 64K. Thus its very plausible for L4 to hand over payload which is 64K. If the L2 protocol is Ethernet, which often is the case, then the MTU will be about 1600 bytes. Hence IP packet will be fragmented at the source host itself.
However, a quick search about IP implementation in Linux tells me that in recent kernels, L4 protocols are fragment friendly i.e. they try to save the fragmentation work for IP by handing over buffers of size which is close to MTU.
Considering these two facts, I am wondering about how frequently does the IP packet gets fragmented at the source host itself.
Does it occur sometimes/rarely/never?
Does anyone know if there are exceptions to the rule of fragmentation in linux kernel (i.e. are there situations where L4 protocols are not fragment friendly)?
How is this handled in other common OSes like windows?
In general how frequently IP packets are fragmented?
Though technically there shouldn't be any protocols that don't handle IP fragmentation correctly, there's a couple (such as NFS) that benefit greatly from lack of fragmentation.
How often you see fragmented packets is largely a function of your network environment. Packet encapsulation for VPNs, poorly-designed or implemented UDP protocols, and L1/L2 protocols that drop the end-to-end MTU below endpoint values can all cause IP fragmentation.
Most modern hosts implement PTMUD, which will automatically detect the MTU size unless non-compliant devices or over-paranoid firewalls are involved. In these days of Ethernet Everywhere, I don't expect them to be particularly common on the internet at large.

Resources