Significance of MTU for loopback interface - linux

I'm exploring/benchmarking various IPC mechanisms for low latency communication between two processes in the same system. I'm using RHEL 6 system for benchmarking.
I'm currently looking into socket based communication through loopback. Since it is the loopback device, the packets do not even hit the NIC. Instead, the loopback linux driver loopbacks the packets to the destination.
However looking into the results of netstat -i, I see an MTU defined for the loopback. What's the role of this and the potential impact on bandwidth?
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
lo0 16384 localhost ::1 1738 - 1738 - -

loopback isn't a physical interface, but still the tcp/ip stack runs a lot of operations on it.
To increase performances on local transfers, kernel developers bumped its mtu from 16 Kb to 64 Kb.
See this commit in the Linux kernel and its rationale:
loopback current mtu of 16436 bytes allows no more than 3 MSS TCP
segments per frame, or 48 Kbytes. Changing mtu to 64K allows TCP
stack to build large frames and significantly reduces stack overhead.
Performance boost on bulk TCP transferts can be up to 30 %, partly because we now have one ACK message for two 64KB segments, and a lower
probability of hitting /proc/sys/net/ipv4/tcp_reordering default limit.
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
static void loopback_setup(struct net_device *dev)
{
- dev->mtu = (16 * 1024) + 20 + 20 + 12;
+ dev->mtu = 64 * 1024;

Related

Why isn't increasing the networking buffer sizes reducing packet drops?

Running Ubuntu 18.04.4 LTS
I have a high-bandwidth file transfer application (UDP) that i'm testing locally using the loopback interface.
With no simulated latency, I can transfer a 1GB file at maximum speed with <1% packet loss. To achieve this, I had to increase the networking buffer sizes from ~200KB to 8MB:
sudo sysctl -w net.core.rmem_max=8388608
sudo sysctl -w net.core.wmem_max=8388608
sudo sysctl -p
For additional testing, I wanted to add a simulated latency of 100ms. This is intended to simulate propagation delay, not queuing delay. I accomplished this using the Linux traffic control (tc) tool:
sudo tc qdisc add dev lo root netem delay 100ms
After adding the latency, packet loss for the 1GB transfer at maximum speed went from <1% to ~97%. In a real network, latency caused by propagation delay shouldn't cause packet loss, so I think the issue is that to simulate latency the kernel would have to store packets in RAM while applying the delay. Since my buffers were only set to 8MB, it made sense that a significant amount of packets would be dropped if simulated latency was added.
I increased my buffer sizes to 50MB:
sudo sysctl -w net.core.rmem_max=52428800
sudo sysctl -w net.core.wmem_max=52428800
sudo sysctl -p
However, there was no noticeable reduction in packet loss. I also attempted 1GB buffer sizes with similar results (my system has >90GB of RAM available).
Why did increasing system network buffer sizes not work in this case?
For some versions of tc, if you do not specify a buffer count limit, tc will default to 1000 buffers.
You can check how many buffers tc is currently using by running:
tc -s qdisc ls dev <device>
For example on my system, where I’ve simulated a 0.1s delay on the eth0 interface I get:
$ tc -s qdisc ls dev eth0
qdisc netem 8024: root refcnt 2 limit 1000 delay 0.1s
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
This shows that I have limit 1000 buffers available to fill during my 0.1s delay period. If I go over this many buffers in my delay timeframe, the system will start dropping packets. Thus this means I have a packet per second (pps) limit of:
pps = buffers / delay
pps = 1000 / 0.1
pps = 10000
If I go beyond this limit, the system will be forced to either drop the incoming packet right away or replace a queued packet, dropping it instead.
Since we don’t normally think of network flows in pps, it’s useful to convert from pps to Bps, KBps, or GBps. This can be done by multiplying by either the network MTU (generally 1500 bytes), the buffer size (varies by system), or ideally by the observed average number of bytes per packet seen by your system on the given interface. Since we don’t know the average bytes per packet, or buffer size of your system at the moment, we’ll fallback to using the typical MTU.
byte rate = pps * bytes per packet
byte rate = 10000pps * 1500 bytes per packet
byte rate = 15000000 Bytes per second
byte rate = 15 MBps
If we are talking about a loopback interface that normally runs at an average of say ~5 Gbps, such as what iperf3 reports for the loopback interface on this MacBook, we can see the problem right away, in that our tc limit of 1.5 MBps is far less than the interface’s practical limit of ~5 GBps.
So if we were transferring a 1GB file over the loopback interface of this system, it should take:
time = file size / byte rate
time = 1Gb / 5GBps
time = 0.2 seconds
To transfer the file across the loopback interface.And the loss, assuming packet size matches buffer size, would be:
packets lost = packets - ((packets that fit in buffers) + (drain rate of buffers * timeframe))
packets lost = (file size / MTU) - ((buffer count) + (drain rate * timeframe))
packets lost = (1 GB / 1500 bytes) - ((10000) + (10000Hz * 0.2 seconds))
packets lost = 654667
And that’s out of:
packets = (file size / MTU)
packets = (1 GB / 1500 bytes)
packets = 666667
So in all that would be a loss percentage of:
loss % = 100 * (lost) / (total)
loss % = 100 * 654667 / 666667
loss % = 98.2%
Which happens to be roughly in line with what you are seeing.
So why didn’t increasing the system buffer size impact your losses? After all the buffer size is part of the computation.
The answer there, is that the method you are using to transmit your file is likely chunking according to it’s best guess at the MTU (likely 1500 bytes), and the packets only make use of the first 1500 bytes of your extra large buffers.
Thus the solution should probably be to increase the number of buffers available to tc instead of increasing the system buffer size. But how many buffers do you need for this link? Based off of this answer the recommendation is to use 150% of the expected number of packets for your delay, so that’s:
buffers = (network rate / avg packet size) * delay * 150%
buffers = (5GBps / 1500B) * 0.1s * 150%
buffers = 333000 * 150%
buffers = 500000
You can see right away that that’s 500 times as many buffers as tc tries to use by default, or to put it another way you only had 2% of the buffers you needed so you saw 98% loss.
Thus to fix your problem, try changing your tc command from something like:
sudo tc qdisc add dev <device> root netem delay 0.1s
To something like:
sudo tc qdisc add dev <device> root netem delay 0.1s limit 500000
To my knowledge, even though its not what you are trying to achieve.. you should probably throtlle up the speed at which you are sending UDP packets because indeed as pointed out by #user3878723 buffers will quickly fill up and packets will be lost. Said differently - quite like #Ron Maupin - when applying delay the interface gets congested. I don't think the emitting process is aware of the 100ms delay so it might overwhelm all available resources quickly.
Instead you may have to tweak something like a Token Bucket Filter (TBF) if you want to go farther in your very use case. Also consider "Rate control".
UPDATE
It could be worth modifying these parameters and make them persistent
net.core.rmem_default
net.core.wmem_default
And/Or make sure you are using correctly these options in your emitter/receiver:
SO_SNDBUF
SO_RCVBUF
So that the whole chain has enough buffer.

iperf UDP over IPv6

I'm doing some UDP bandwidth tests using iperf (https://iperf.fr/) over IPv6. I have very bad results when using a Linux UDP client with the following command line:
iperf -u -V -c fe80::5910:d4ff:fe31:5b10%usb0 -b 100m
Investigating the issue with Wireshark I have seen there is some fragmentation while the client is sending data. To be more precise, I see UDP client outgoing packets with size 1510 bytes and 92 bytes, alternating.
For example, the UDP packets that I see have the following pattern (in size): 1510, 92, 1510, 92, 1510, 92,...,1510, 92,...
Reading iperf2 documentation I read the following for option (-l) :
The length of buffers to read or write. iPerf works by writing an array of len bytes a number of times. Default is 8 KB for TCP, 1470 bytes for UDP. Note for UDP, this is the datagram size and needs to be lowered when using IPv6 addressing to 1450 or less to avoid fragmentation. See also the -n and -t options.
I have tried to do the same bandwidth test by replacing the Linux iperf UDP client command line with the following:
iperf -u -V -c fe80::5910:d4ff:fe31:5b10%usb0 -b 100m -l1450
and I see good results. Looking at the Wireshark capture I see no fragmentation anymore.
Doing the same test over IPv4 I don't need to change the default UDP datagram size (I don't need to use '-l' option) to get good results.
So my conclusion is that fragmentation (over IPv6) is responsible for poor bandwidth performances.
Anyway, I'm wondering what really happens when setting UDP datagram size to 1450 over IPv6. Why do I have fragmentation over IPv6 and not over IPv4 with default value for UDP datagram size? Moreover, why do I have no fragmentation when reducing the UDP datagram size to 1450?
Thank you.
The base IPv4 header is 20 bytes, the base IPv6 header is 40 bytes and the UDP header is 8 bytes.
With IPv4 the total packet size is 1470+8+20=1498, which is less than the default ethernet MTU of 1500.
With IPv6 the total packet size is 1470+8+40=1518, which is more than 1500 and has to be fragmented.
Now let's look into your observations. You see packets of size 1510 and 92. Those include the ethernet header, which is 14 bytes. Your IPv6 packets are therefore 1496 and 78 bytes. The contents of the big packet are: IPv6 header (40 bytes), a fragmentation header (8), the UDP header (8) and 1440 bytes of data. The smaller packet contains the IPv6 header (40), a fragmentation header (8) and the remaining 30 bytes of data.
The most common MTU for Ethernet is 1500, not including ethernet frame headers. This means you can send 1500 bytes in one packet over the wire, including IP headers. IPv6 headers are larger than IPv4 headers for several reasons, with the most important being that IPv6 addresses are larger than IPv4. So when you run with the default value over IPv6 your packet size goes over the MTU size and the packet needs to be split into two; a procedure known as fragmentation.

Traffic shaping with tc is inaccurate with high bandwidth and delay

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?
It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.
Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Do MTU modifications impact both directions?

ifconfig 1.2.3.4 mtu 1492
This will set MTU to 1492 for incoming, outgoing packets or both? I think it is only for incoming
TLDR: Both. It will only transmit packets with a payload length less than or equal to that size. Similarly, it will only accept packets with a payload length within your MTU. If a device sends a larger packet, it should respond with an ICMP unreachable (oversized) message.
The nitty gritty:
Tuning the MTU for your device is useful because other hops between you and your destination may encapsulate your packet in another form (for example, a VPN or PPPoE.) This layer around your packet results in a bigger packet being sent along the wire. If this new, larger packet exceeds the maximum size of the layer, then the packet will be split into multiple packets (in a perfect world) or will be dropped entirely (in the real world.)
As a practical example, consider having a computer connected over ethernet to an ADSL modem that speaks PPPoE to an ISP. Ethernet allows for a 1500 byte payload, of which 8 bytes will be used by PPPoE. Now we're down to 1492 bytes that can be delivered in a single packet to your ISP. If you were to send a full-size ethernet payload of 1500 bytes, it would get "fragmented" by your router and split into two packets (one with a 1492 byte payload, the other with an 8 byte payload.)
The problem comes when you want to send more data over this connection - lets say you wanted to send 3000 bytes: your computer would split this up based on your MTU - in this case, two packets of 1500 bytes each, and send them to your ADSL modem which would then split them up so that it can fulfill its MTU. Now your 3000 byte data has been fragmented into four packets: two with a payload of 1492 bytes and two with a payload of 8 bytes. This is obviously inefficient, we really only need three packets to send this data. Had your computer been configured with the correct MTU for the network, it would have sent this as three packets in the first place (two 1492 byte packets and one 16 byte packet.)
To avoid this inefficiency, many IP stacks flip a bit in the IP header called "Don't Fragment." In this case, we would have sent our first 1500 byte packet to the ADSL modem and it would have rejected the packet, replying with an Internet Control (ICMP) message informing us that our packet is too large. We then would have retried the transmission with a smaller packet. This is called Path MTU discovery. Similarly, a layer below, at the TCP layer, another factor in avoiding fragmentation is the MSS (Maximum Segment Size) option where both hosts reply with the maximum size packet they can transfer without fragmenting. This is typically computed from the MTU.
The problem here arises when misconfigured firewalls drop all ICMP traffic. When you connect to (say) a web server, you build a TCP session and send that you're willing to accept TCP packets based on your 1500 byte MTU (since you're connected over ethernet to your router.) If the foreign web server wanted to send you a lot of data, they would split this into chunks that (when combined with the TCP and IP headers) came out to 1500 byte payloads and send them to you. Your ISP would receive one of these and then try to wrap it into a PPPoE packet to send to your ADSL modem, but it would be too large to send. So it would reply with an ICMP unreachable, which would (in a perfect world) cause the remote computer to downsize its MSS for the connection and retransmit. If there was a broken firewall in the way, however, this ICMP message would never be reached by the foreign web server and this packet would never make it to you.
Ultimately setting your MTU on your ethernet device is desirable to send the right size frames to your ADSL modem (to avoid it asking you to retransmit with a smaller frame), but it's critical to influence the MSS size you send to remote hosts when building TCP connections.
ifconfig ... mtu <value> sets the MTU for layer2 payloads sent out the interface, and will reject larger layer2 payloads received on this interface. You must ensure your MTU matches on both sides of an ethernet link; you should not have mismatched mtu values anywhere in the same ethernet broadcast domain. Note that the ethernet headers are not included in the MTU you are setting.
Also, ifconfig has not been maintained in linux for ages and is old and deprecated; sadly linux distributions still include it because they're afraid of breaking old scripts. This has the very negative effect of encouraging people to continue using it. You should be using the iproute2 family of commands:
[mpenning#hotcoffee ~]$ sudo ip link set mtu 1492 eth0
[mpenning#hotcoffee ~]$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc mq state UP qlen 1000
link/ether 00:1e:c9:cd:46:c8 brd ff:ff:ff:ff:ff:ff
[mpenning#hotcoffee ~]$
Large incoming packets may be dropped based on the interface MTU size.
For example, the default MTU 1500 on
Linux 2.6 CentOS (tested with Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01))
drops Jumbo packets >1504. Errors appear in ifconfig and there are rx_long_length_errors indications for this in ethtool -S output.
Increasing MTU indicates Jumbo packets should be supported.
The threshold for when to drop packets based on their size being too large appears to depend on MTU (-4096, -8192, etc.)
Oren
It's the Maximum Transmission Unit, so it definitely sets the outgoing maximum packet size. I'm not sure if will reject incoming packets larger than the MTU.
There is no doubt that MTU configured by ifconfig impacts Tx ip fragmentation, I have no more comments.
But for Rx direction, I find whether the parameter impacts incoming IP packets, it depends. Different manufacturer behaves differently.
I tested all the devices on hand and found 3 cases below.
Test case:
Device0 eth0 (192.168.225.1, mtu 2000)<--ETH cable-->Device1 eth0
(192.168.225.34, mtu MTU_SIZE)
On Device0 ping 192.168.225.34 -s ICMP_SIZE,
Checking how MTU_SIZE impacts Rx of Device1.
case 1:
Device1 = Linux 4.4.0 with Intel I218-LM:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above. It seems that there is a PRACTICAL MTU=1504 (20B(IP header)+8B(ICMP header)+1476B(ICMP data)).
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1476, 1478, 1600, 1900. It seems that jumbo frame is switched on once MTU_SIZE is set >1500 and there is no 1504 restriction any more.
case 2:
Device1 = Linux 3.18.31 with Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above.
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1466, fails at ICMP_SIZE=1467 and above.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1477, fails at ICMP_SIZE=1478 and above.
When MTU_SIZE=500, ping succeeds at ICMP_SIZE=476, fails at ICMP_SIZE=477 and above.
When MTU_SIZE=1900, ping succeeds at ICMP_SIZE=1876, fails at ICMP_SIZE=1877 and above.
This case behaves exactly as Edward Thomson said, except that in my test the PRACTICAL MTU=MTU_SIZE+4.
case 3:
Device1 = Linux 4.4.50 with Raspberry Pi 2 Module B ETH:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1472, fails at ICMP_SIZE=1473 and above. So there is a PRACTICAL MTU=1500 (20B(IP header)+8B(ICMP header)+1472B(ICMP data)) working there.
When MTU_SIZE=1490, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, behave the same as MTU_SIZE=1500.
When MTU_SIZE=2000, behave the same as MTU_SIZE=1500.
When MTU_SIZE=500, behave the same as MTU_SIZE=1500.
This case behaves exactly as Ron Maupin said in Why MTU configuration doesn't take effect on receiving direction?.
To sum it all, in real world, after you set ifconfig mtu,
sometimes the Rx IP packts get dropped when exceed 1504 , no matter what MTU value you set (except that the jumbo frame is enabled).
sometimes the Rx IP packts get dropped when exceed the MTU+4 you set on receiving device.
sometimes the Rx IP packts get dropped when exceed 1500, no matter what MTU value you set.
... ...

UDP IP Fragmentation and MTU

I'm trying to understand some behavior I'm seeing in the context of sending UDP packets.
I have two little Java programs: one that transmits UDP packets, and the other that receives them. I'm running them locally on my network between two computers that are connected via a single switch.
The MTU setting (reported by /sbin/ifconfig) is 1500 on both network adapters.
If I send packets with a size < 1500, I receive them. Expected.
If I send packets with 1500 < size < 24258 I receive them. Expected. I have confirmed via wireshark that the IP layer is fragmenting them.
If I send packets with size > 24258, they are lost. Not Expected. When I run wireshark on the receiving side, I don't see any of these packets.
I was able to see similar behavior with ping -s.
ping -s 24258 hostA works but
ping -s 24259 hostA fails.
Does anyone understand what may be happening, or have ideas of what I should be looking for?
Both computers are running CentOS 5 64-bit. I'm using a 1.6 JDK, but I don't really think it's a programming problem, it's a networking or maybe OS problem.
Implementations of the IP protocol are not required to be capable of handling arbitrarily large packets. In theory, the maximum possible IP packet size is 65,535 octets, but the standard only requires that implementations support at least 576 octets.
It would appear that your host's implementation supports a maximum size much greater than 576, but still significantly smaller than the maximum theoretical size of 65,535. (I don't think the switch should be a problem, because it shouldn't need to do any defragmentation -- it's not even operating at the IP layer).
The IP standard further recommends that hosts not send packets larger than 576 bytes, unless they are certain that the receiving host can handle the larger packet size. You should maybe consider whether or not it would be better for your program to send a smaller packet size. 24,529 seems awfully large to me. I think there may be a possibility that a lot of hosts won't handle packets that large.
Note that these packet size limits are entirely separate from MTU (the maximum frame size supported by the data link layer protocol).
I found the following which may be of interest:
Determine the maximum size of a UDP datagram packet on Linux
Set the DF bit in the IP header and send continually larger packets to determine at what point a packet is fragmented as per Path MTU Discovery. Packet fragmentation should then result in a ICMP type 3 packet with code 4 indicating that the packet was too large to be sent without being fragmented.
Dan's answer is useful but note that after headers you're really limited to 65507 bytes.

Resources