UDP IP Fragmentation and MTU

UDP IP Fragmentation and MTU - linux

I'm trying to understand some behavior I'm seeing in the context of sending UDP packets.
I have two little Java programs: one that transmits UDP packets, and the other that receives them. I'm running them locally on my network between two computers that are connected via a single switch.
The MTU setting (reported by /sbin/ifconfig) is 1500 on both network adapters.
If I send packets with a size < 1500, I receive them. Expected.
If I send packets with 1500 < size < 24258 I receive them. Expected. I have confirmed via wireshark that the IP layer is fragmenting them.
If I send packets with size > 24258, they are lost. Not Expected. When I run wireshark on the receiving side, I don't see any of these packets.
I was able to see similar behavior with ping -s.
ping -s 24258 hostA works but
ping -s 24259 hostA fails.
Does anyone understand what may be happening, or have ideas of what I should be looking for?
Both computers are running CentOS 5 64-bit. I'm using a 1.6 JDK, but I don't really think it's a programming problem, it's a networking or maybe OS problem.

Implementations of the IP protocol are not required to be capable of handling arbitrarily large packets. In theory, the maximum possible IP packet size is 65,535 octets, but the standard only requires that implementations support at least 576 octets.
It would appear that your host's implementation supports a maximum size much greater than 576, but still significantly smaller than the maximum theoretical size of 65,535. (I don't think the switch should be a problem, because it shouldn't need to do any defragmentation -- it's not even operating at the IP layer).
The IP standard further recommends that hosts not send packets larger than 576 bytes, unless they are certain that the receiving host can handle the larger packet size. You should maybe consider whether or not it would be better for your program to send a smaller packet size. 24,529 seems awfully large to me. I think there may be a possibility that a lot of hosts won't handle packets that large.
Note that these packet size limits are entirely separate from MTU (the maximum frame size supported by the data link layer protocol).

I found the following which may be of interest:
Determine the maximum size of a UDP datagram packet on Linux
Set the DF bit in the IP header and send continually larger packets to determine at what point a packet is fragmented as per Path MTU Discovery. Packet fragmentation should then result in a ICMP type 3 packet with code 4 indicating that the packet was too large to be sent without being fragmented.
Dan's answer is useful but note that after headers you're really limited to 65507 bytes.

Related

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.

To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.

When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).

TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

Finding out the number of dropped packets in raw sockets

I am developing a program that sniffs network packets using a raw socket (AF_PACKET, SOCK_RAW) and processes them in some way.
I am not sure whether my program runs fast enough and succeeds to capture all packets on the socket. I am worried that the recieve buffer for this socket occainally gets full (due to traffic bursts) and some packets are dropped.
How do I know if packets were dropped due to lack of space in the
socket's receive buffer?
I have tried running ss -f link -nlp.
This outputs the number of bytes that are currently stored in the revice buffer for that socket, but I can not tell if any packets were dropped.
I am using Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-52-generic x86_64).
Thanks.

I was having a similar problem as you. I knew that tcpdump was able to to generate statistics about packet drops, so I tried to figure out how it did that. By looking at the code of tcpdump, I noticed that it is not generating those statistic by itself, but that it is using the libpcap library to get those statistics. The libpcap is on the other hand getting those statistics by accessing the if_packet.h header and calling the PACKET_STATISTICS socket option (at least I think so, but I'm no C expert).
Therefore, I saw only two solutions to the problem:
I had to interact somehow with the linux header files from my Pyhton script to get the packet statistics, which seemed a bit complicated.
Use the Python version of libpcap which is pypcap to get those information.
Since I had no clue how to do the first thing, I implemented the second option. Here is an example how to get packet statistics using pypcap and how to get the packet data using dpkg:
import pcap
import dpkt
import socket
pc=pcap.pcap(name="eth0", timeout_ms=10000, immediate=True)
def packet_handler(ts,pkt):
#printing packet statistic (packets received, packets dropped, packets dropped by interface
print pc.stats()
#example packet parsing using dpkt
eth=dpkt.ethernet.Ethernet(pkt)
if eth.type != dpkt.ethernet.ETH_TYPE_IP:
return
ip =eth.data
layer4=ip.data
ipsrc=socket.inet_ntoa(ip.src)
ipdst=socket.inet_ntoa(ip.dst)
pc.loop(0,packet_handler)
tpacket_stats structure is defined in linux/packet.h header file
Create variable using the tpacket_stats structre and pass it to getSockOpt with PACKET_STATISTICS SOL_SOCKET options will give packets received and dropped count.
-- some times drop can be due to buffer size
-- so if you want to decrease the drop count check increasing the buffersize using setsockopt function

First off, switch your operating system.
You need a reliable, network oriented operating system. Not some pink fluffy "ease of use" with "security" functionality enabled. NetBSD or Gentoo/ArchLinux (the bare installations, not the GUI kitted ones).
Start a simultaneous tcpdump on a network tap and capture the traffic you're supposed to receive along side of your program and compare the results.
There's no efficient way to check if you've received all the packets you intended to on the receiving end since the packets might be dropped on a lower level than you anticipate.
Also this is a question for Unix # StackOverflow, there's no programming here what I can see, at least there's no code.
The only certain way to verify packet drops is to have a much more beefy sender (perhaps a farm of machines that send packets) to a single client, record every packet sent to your reciever. Have the statistical data analyzed and compared against your senders and see how much you dropped.
The cheaper way is to buy a network tap or even more ad-hoc enable port mirroring in your switch if possible. This enables you to dump as much traffic as possible into a second machine.
This will give you a more accurate result because your application machine will be busy as it is taking care of incoming traffic and processing it.
Further more, this is why network taps are effective because they split the communication up into two channels, the receiving and sending directions of your traffic if you will. This enables you to capture traffic on two separate machines (also using tcpdump, but instead of a mirrored port, you get a more accurate traffic mirroring).
So either use port mirroring
Or you buy one of these:

socket UDP recvfrom not working for large packets even if buffer given is enough on LINUX

I am calling a recvfrom api of a valid address , where i am trying to read data of size 9600 bytes , the buffer i have provided i of size 12KB , I am not even getting select read events.
Even tough recommended MTU size is 1.5 KB, I am able to send and receive packets of 4 KB.
I am using android NDK , (Linux) for development.
Please help . Is there a socket Option i have to set to read large buffers ?

If you send a packet larger than the MTU, it will be fragmented. That is, it'll be broken up into smaller pieces, each which fits within the MTU. The problem with this is that if even one of those pieces is lost (quite likely on a cellular connection...), the entire packet will effectively disappear.
To determine whether this is the case you'll need to use a packet sniffer on one (or both) ends of the connection. Wireshark is a good choice on a PC end, or tcpdump on the android side (you'll need root). Keep in mind that home routers may reassemble fragmented packets - this means that if you're sniffing packets from inside a home router/firewall, you might not see any fragments arrive until all of them arrive at the router (and obviously if some are getting lost this won't happen).
A better option would be to simply ensure that you're always sending packets smaller than the MTU, of course. Fragmentation is almost never the right thing to be doing. Keep in mind that the MTU may vary at various hops along the path between server and client - you can either use the common choice of a bit less than 1500 (1400 ought to be safe), or try to probe for it by setting the MTU discovery flag on your UDP packets (via IP_MTU_DISCOVER) and always sending less than the value returned by getsockopt's IP_MTU option (including on retransmits!)

How to programmatically increase the per-socket buffer for UDP sockets on LInux?

I'm trying to understand the correct way to increase the socket buffer size on Linux for our streaming network application. The application receives variable bitrate data streamed to it on a number of UDP sockets. The volume of data is substantially higher at the start of the stream and I've used:
# sar -n UDP 1 200
to show that the UDP stack is discarding packets and
# ss -un -pa
to show that each socket Recv-Q length grows to the nearly the limit (124928. from sysctl net.core.rmem_default) before packets are discarded. This implies that the application simply can't keep up with the start of the stream. After discarding enough initial packets the data rate slows down and the application catches up. Recv-Q trends towards 0 and remains there for the duration.
I'm able to address the packet loss by substantially increasing the rmem_default value which increases the socket buffer size and gives the application time to recover from the large initial bursts. My understanding is that this changes the default allocation for all sockets on the system. I'd rather just increase the allocation for the specific UDP sockets and not modify the global default.
My initial strategy was to modify rmem_max and to use setsockopt(SO_RCVBUF) on each individual socket. However, this question makes me concerned about disabling Linux autotuning for all sockets and not just UDP.
udp(7) describes the udp_mem setting but I'm confused how these values interact with the rmem_default and rmem_max values. The language it uses is "all sockets", so my suspicion is that these settings apply to the complete UDP stack and not individual UDP sockets.
Is udp_rmem_min the setting I'm looking for? It seems to apply to individual sockets but global to all UDP sockets on the system.
Is there a way to safely increase the socket buffer length for the specific UDP ports used in my application without modifying any global settings?
Thanks.

Jim Gettys is armed and coming for you. Don't go to sleep.
The solution to network packet floods is almost never to increase buffering. Why is your protocol's queueing strategy not backing off? Why can't you just use TCP if you're trying to send so much data in a stream (which is what TCP was designed for).

Do MTU modifications impact both directions?

ifconfig 1.2.3.4 mtu 1492
This will set MTU to 1492 for incoming, outgoing packets or both? I think it is only for incoming

TLDR: Both. It will only transmit packets with a payload length less than or equal to that size. Similarly, it will only accept packets with a payload length within your MTU. If a device sends a larger packet, it should respond with an ICMP unreachable (oversized) message.
The nitty gritty:
Tuning the MTU for your device is useful because other hops between you and your destination may encapsulate your packet in another form (for example, a VPN or PPPoE.) This layer around your packet results in a bigger packet being sent along the wire. If this new, larger packet exceeds the maximum size of the layer, then the packet will be split into multiple packets (in a perfect world) or will be dropped entirely (in the real world.)
As a practical example, consider having a computer connected over ethernet to an ADSL modem that speaks PPPoE to an ISP. Ethernet allows for a 1500 byte payload, of which 8 bytes will be used by PPPoE. Now we're down to 1492 bytes that can be delivered in a single packet to your ISP. If you were to send a full-size ethernet payload of 1500 bytes, it would get "fragmented" by your router and split into two packets (one with a 1492 byte payload, the other with an 8 byte payload.)
The problem comes when you want to send more data over this connection - lets say you wanted to send 3000 bytes: your computer would split this up based on your MTU - in this case, two packets of 1500 bytes each, and send them to your ADSL modem which would then split them up so that it can fulfill its MTU. Now your 3000 byte data has been fragmented into four packets: two with a payload of 1492 bytes and two with a payload of 8 bytes. This is obviously inefficient, we really only need three packets to send this data. Had your computer been configured with the correct MTU for the network, it would have sent this as three packets in the first place (two 1492 byte packets and one 16 byte packet.)
To avoid this inefficiency, many IP stacks flip a bit in the IP header called "Don't Fragment." In this case, we would have sent our first 1500 byte packet to the ADSL modem and it would have rejected the packet, replying with an Internet Control (ICMP) message informing us that our packet is too large. We then would have retried the transmission with a smaller packet. This is called Path MTU discovery. Similarly, a layer below, at the TCP layer, another factor in avoiding fragmentation is the MSS (Maximum Segment Size) option where both hosts reply with the maximum size packet they can transfer without fragmenting. This is typically computed from the MTU.
The problem here arises when misconfigured firewalls drop all ICMP traffic. When you connect to (say) a web server, you build a TCP session and send that you're willing to accept TCP packets based on your 1500 byte MTU (since you're connected over ethernet to your router.) If the foreign web server wanted to send you a lot of data, they would split this into chunks that (when combined with the TCP and IP headers) came out to 1500 byte payloads and send them to you. Your ISP would receive one of these and then try to wrap it into a PPPoE packet to send to your ADSL modem, but it would be too large to send. So it would reply with an ICMP unreachable, which would (in a perfect world) cause the remote computer to downsize its MSS for the connection and retransmit. If there was a broken firewall in the way, however, this ICMP message would never be reached by the foreign web server and this packet would never make it to you.
Ultimately setting your MTU on your ethernet device is desirable to send the right size frames to your ADSL modem (to avoid it asking you to retransmit with a smaller frame), but it's critical to influence the MSS size you send to remote hosts when building TCP connections.

ifconfig ... mtu <value> sets the MTU for layer2 payloads sent out the interface, and will reject larger layer2 payloads received on this interface. You must ensure your MTU matches on both sides of an ethernet link; you should not have mismatched mtu values anywhere in the same ethernet broadcast domain. Note that the ethernet headers are not included in the MTU you are setting.
Also, ifconfig has not been maintained in linux for ages and is old and deprecated; sadly linux distributions still include it because they're afraid of breaking old scripts. This has the very negative effect of encouraging people to continue using it. You should be using the iproute2 family of commands:
[mpenning#hotcoffee ~]$ sudo ip link set mtu 1492 eth0
[mpenning#hotcoffee ~]$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc mq state UP qlen 1000
link/ether 00:1e:c9:cd:46:c8 brd ff:ff:ff:ff:ff:ff
[mpenning#hotcoffee ~]$

Large incoming packets may be dropped based on the interface MTU size.
For example, the default MTU 1500 on
Linux 2.6 CentOS (tested with Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01))
drops Jumbo packets >1504. Errors appear in ifconfig and there are rx_long_length_errors indications for this in ethtool -S output.
Increasing MTU indicates Jumbo packets should be supported.
The threshold for when to drop packets based on their size being too large appears to depend on MTU (-4096, -8192, etc.)
Oren

It's the Maximum Transmission Unit, so it definitely sets the outgoing maximum packet size. I'm not sure if will reject incoming packets larger than the MTU.

There is no doubt that MTU configured by ifconfig impacts Tx ip fragmentation, I have no more comments.
But for Rx direction, I find whether the parameter impacts incoming IP packets, it depends. Different manufacturer behaves differently.
I tested all the devices on hand and found 3 cases below.
Test case:
Device0 eth0 (192.168.225.1, mtu 2000)<--ETH cable-->Device1 eth0
(192.168.225.34, mtu MTU_SIZE)
On Device0 ping 192.168.225.34 -s ICMP_SIZE,
Checking how MTU_SIZE impacts Rx of Device1.
case 1:
Device1 = Linux 4.4.0 with Intel I218-LM:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above. It seems that there is a PRACTICAL MTU=1504 (20B(IP header)+8B(ICMP header)+1476B(ICMP data)).
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1476, 1478, 1600, 1900. It seems that jumbo frame is switched on once MTU_SIZE is set >1500 and there is no 1504 restriction any more.
case 2:
Device1 = Linux 3.18.31 with Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above.
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1466, fails at ICMP_SIZE=1467 and above.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1477, fails at ICMP_SIZE=1478 and above.
When MTU_SIZE=500, ping succeeds at ICMP_SIZE=476, fails at ICMP_SIZE=477 and above.
When MTU_SIZE=1900, ping succeeds at ICMP_SIZE=1876, fails at ICMP_SIZE=1877 and above.
This case behaves exactly as Edward Thomson said, except that in my test the PRACTICAL MTU=MTU_SIZE+4.
case 3:
Device1 = Linux 4.4.50 with Raspberry Pi 2 Module B ETH:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1472, fails at ICMP_SIZE=1473 and above. So there is a PRACTICAL MTU=1500 (20B(IP header)+8B(ICMP header)+1472B(ICMP data)) working there.
When MTU_SIZE=1490, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, behave the same as MTU_SIZE=1500.
When MTU_SIZE=2000, behave the same as MTU_SIZE=1500.
When MTU_SIZE=500, behave the same as MTU_SIZE=1500.
This case behaves exactly as Ron Maupin said in Why MTU configuration doesn't take effect on receiving direction?.
To sum it all, in real world, after you set ifconfig mtu,
sometimes the Rx IP packts get dropped when exceed 1504 , no matter what MTU value you set (except that the jumbo frame is enabled).
sometimes the Rx IP packts get dropped when exceed the MTU+4 you set on receiving device.
sometimes the Rx IP packts get dropped when exceed 1500, no matter what MTU value you set.
... ...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string