How frequently IP packets are fragmented at the source host? - linux

I know that if IP payload > MTU then routers usually fragment the IP packet. Finally all the fragmented packets are assembled at the destination using the fields IP-ID, IP fragment offsets and fragmentation flags. Max length of IP payload is 64K. Thus its very plausible for L4 to hand over payload which is 64K. If the L2 protocol is Ethernet, which often is the case, then the MTU will be about 1600 bytes. Hence IP packet will be fragmented at the source host itself.
However, a quick search about IP implementation in Linux tells me that in recent kernels, L4 protocols are fragment friendly i.e. they try to save the fragmentation work for IP by handing over buffers of size which is close to MTU.
Considering these two facts, I am wondering about how frequently does the IP packet gets fragmented at the source host itself.
Does it occur sometimes/rarely/never?
Does anyone know if there are exceptions to the rule of fragmentation in linux kernel (i.e. are there situations where L4 protocols are not fragment friendly)?
How is this handled in other common OSes like windows?
In general how frequently IP packets are fragmented?

Though technically there shouldn't be any protocols that don't handle IP fragmentation correctly, there's a couple (such as NFS) that benefit greatly from lack of fragmentation.
How often you see fragmented packets is largely a function of your network environment. Packet encapsulation for VPNs, poorly-designed or implemented UDP protocols, and L1/L2 protocols that drop the end-to-end MTU below endpoint values can all cause IP fragmentation.
Most modern hosts implement PTMUD, which will automatically detect the MTU size unless non-compliant devices or over-paranoid firewalls are involved. In these days of Ethernet Everywhere, I don't expect them to be particularly common on the internet at large.

Related

how is TCP's checksum calculated when we use tcpdump to capture packets which we send out

I am trying to generate a series of packets to simulate the TCP 3-way handshake procedure, my first step is to capture the real connecting packets, and try to re-send the same packets from the same machine, but it didn't work at first.
finally I found it out that the packet I captured with tcpdump is not exactly what my computer sent out, the TCP's checksum field is changed and it lead me to thinkk that I can establish a tcp connection even the TCP checksum is incorrect.
so my question is how is the checksum field calculated? is it modified by tcpdump or hardware? why is it changed? Is it a bug of tcpdump? or it's because the calculation is omitted.
the following is the screenshot I captured from my host machine and a virtual machinne, you can see that the same packet captured on differnet machine are all the same except for the TCP checksum.
and the small window is my virtual machine, I used command "ssh 10.82.25.138" from the host to generate these packets
What you are seeing may be the result of checksum offloading. To quote from the wireshark wiki (http://wiki.wireshark.org/CaptureSetup/Offloading):
Most modern operating systems support some form of network offloading,
where some network processing happens on the NIC instead of the CPU.
Normally this is a great thing. It can free up resources on the rest
of the system and let it handle more connections. If you're trying to
capture traffic it can result in false errors and strange or even
missing traffic.
On systems that support checksum offloading, IP, TCP, and UDP
checksums are calculated on the NIC just before they're transmitted on
the wire. In Wireshark these show up as outgoing packets marked black
with red Text and the note [incorrect, should be xxxx (maybe caused by
"TCP checksum offload"?)].
Wireshark captures packets before they are sent to the network
adapter. It won't see the correct checksum because it has not been
calculated yet. Even worse, most OSes don't bother initialize this
data so you're probably seeing little chunks of memory that you
shouldn't.
Although this is for wireshark, the same principle applies. In your host machine, you see the wrong checksum because it just hasn't been filled in yet. It looks right on the guest, because before it's sent out on the "wire" it is filled in. Try disabling checksum offloading on the interface which is handling this traffic, e.g.:
ethtool -K eth0 rx off tx off
if it's eth0.

UDP packet greater than 1500 bytes dropped

I'm developing a tftp client and server and I want to dynamically select the udp payload size to boost transfer performance.
I have tested it with two linux machines ( one has a gigabit ethernet card, the other a fast ethernet one ). I changed the MTU of the gigabit card to 2048 bytes and left the other to 1500.
I have used setsockopt(sockfd, IPPROTO_IP, IP_MTU_DISCOVER, &optval, sizeof(optval)) to set the MTU_DISCOVER flag to IP_PMTUDISC_DO.
From what I have read this option should set the DF bit to one and so it should be possible to find the minimum MTU of the network ( the MTU of the host that has the lowest MTU ). However this thing only gives me an error when I send a packet which size is bigger than the MTU of the machine from which I'm sending packets.
Also the other machine ( the server in this case ) doesn't receive the oversized packets ( the server has a MTU of 1500 ). All the UDP packets are dropped, the only way is to send packets of 1472 bytes.
Why the hosts do this? From what I have read, if I send a packet larger than MTU, the ip layer should fragment it.
I fail to see the problem. You are setting the "don't fragment" bit, and you send a package smaller than the sending host's MTU, but larger than the receiving host's MTU. Of course nobody will fragment here (doing so would violate the DF bit). Instead, the sending host should get an ICMP message back.
Edit: IP specifies that an ICMP error message type 3 (destination unreachable) code 4 (Fragmentation Required but DF Bit Is Set) is sent to the originating host at the point where the fragmentation would have occurred. The TCP layer handles this on its own for PMTU discovery. On connection-less sockets, Linux reports the error in the socket's error queue if the IP_RECVERR option is activated; see ip(7).
That "DF bit" you're setting, stands for "Don't Fragment". The IP layer should not be expected to fragment packets when you've told it not to.
It is not correct to run hosts with different interface MTUs on the same subnet1.
This is a host/network misconfiguration, and IP path MTU discovery is not expected to work correctly in this situation.
If you wish to test your application's path MTU discovery, you will need to set up multiple subnets connected by a router2, with different MTUs. In this situation, the router is the device that will pick up the MTU mismatch, and send back an ICMP "Fragmentation Needed" error.
1. Well, technically, same broadcast domain.
2. The devices sold as "home routers" are really router/switches - they route between the WAN and the LAN, but switch between the ethernet ports on the LAN. This isn't sufficient to separate networks with different MTUs.

Where do the Linux kernel router code replace the MAC address

The router will replace the source MAC address of the package it received with the address of the previous hop and the destination MAC address with the address of next hop.
The linux provides a functionality to worked as a router. My question is how the kernel code implement the function for mac address update during its package forwarding process? And where is this part of code
I try to find the codes in /net/ipv4, but can not found anything...
That is not what actually happens.
IP is not dependent on ethernet, so what happens is dependent upon the underlying protocol of the lower layer.
The same thing happens if it is a locally-originated IP packet, or if it is one which has been routed for another host.
Linux's IPv4 stack is not ethernet-dependent in any way, in fact lots of other link-layer protocols are supported by the kernel. IP being a WAN protocol, you can route between different underlying protocols. Some examples are
ppp, slip (serial lines)
PPTP, GRE (for tunnels, mostly VPNs)
IP over ATM
Token ring (mostly legacy, I think)
Loopback and dummy (for local communication only)
Wifi (although this is actually mostly identical to ethernet)
So what actually happens when routing IP frames from one ethernet interface to another is that the link-layer is stripped off completely, then a new link-layer is formed after routing. If the protocol were not ethernet, an appropriate link-layer packet for that protocol would be used instead.
So nobody "changes the MAC address", but rather, the link-layer packet is just completely rebuilt.

Tune MTU on a per-socket basis?

I was wondering if there is any way to tune (on a linux system), the MTU for a given socket. (To make IP layer fragmenting into chunks smaller that the actual device MTU).
When I say for a given socket, I don't mean programatically in the code of the application owning the socket but rather externally, for example via a sysfs entry.
If there is currently no way do that, do you have any ideas about where to hook/patch in linux kernel to implement such a possibility ?
Thanks.
EDIT: why the hell do I want to do that ?
I'm doing some Layer3-in-Layer4 (eg: tunneling IP and above through TCP tunnel) tunneling. Unlike VPN-like solutions, I'm not using a virtual interface to achieve that. I'm capturing packets using iptables, dropping them for their normal way and writing them to the tunnel socket.
Think about the case of a big file transfer, all packets are filled up to MTU size. When I tunnel them, I add some overhead, leading in every original packet to produce two tunneled packets, it's under-optimal.
If the socket is created such that DF set on outgoing packets you might have some luck in spoofing (injecting) an ICMP fragmentation needed message back at yourself until you end up with the desired MTU. Rather ugly, but depending on how desperate you are it might be appropriate.
You could for example generate these packets with iptables rules, so the matching and sending is simple and external to your application. It looks like the REJECT target for iptables doesn't have a reject-with of fragmentation needed though, it probably wouldn't be too tricky to add one.
The other approach, if it's only TCP packets you care about is you might have some luck with the socket option TCP_MAXSEG or the TCPMSS target if that's appropriate to your problem.
For UDP or raw you're free to send() packets as small as you fancy!
Update:
Based on the "why would I want to do that?" answer, it seems like fragmenting packets if DF isn't set or raising ICMP "fragmentation needed" and dropping would actually be the correct solution.
It's what a more "normal" router would do and provided firewalls don't eat the ICMP packet then it will behave sanely in all scenarios, whereas retrospectively changing things is a recipe for odd behaviour.
The iptables clamp mss is quite a good fix for TCP over this "VPN" though, especially as you're already making extensive use of iptables it seems.
MTU is a property of a link, not socket. They belong to different layers of the stack. That said TCP performs Path MTU discovery during the three-way handshake and tries very hard to avoid fragmentation. You'll have hard time making TCP send fragments. With UDP the easiest is to force some smallish MTU on an interface with ifconfig(8) and then send packets larger then that value.

How to set the maximum TCP Maximum Segment Size on Linux?

In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.
Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.
ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500
The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?
MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.
You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.

Resources