UDP Packet drop - INErrors Vs .RcvbufErrors

UDP Packet drop - INErrors Vs .RcvbufErrors - linux

I wrote a simple UDP Server program to understand more about possible network bottlenecks.
UDP Server: Creates a UDP socket, binds it to a specified port and addr, and adds the socket file descriptor to epoll interest list. Then its epoll waits for incoming packet. On reception of incoming packet(EPOLLIN), its reads the packet and just prints the received packet length. Pretty simple, right :)
UDP Client: I used hping as shown below:
hping3 192.168.1.2 --udp -p 9996 --flood -d 100
When I send udp packets at 100 packets per second, I dont find any UDP packet loss. But when I flood udp packets (as shown in above command), I see significant packet loss.
Test1:
When 26356 packets are flooded from UDP client, my sample program receives ONLY 12127 packets and the remaining 14230 packets is getting dropped by kernel as shown in /proc/net/snmp output.
cat /proc/net/snmp | grep Udp:
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 12372 0 14230 218 14230 0
For Test1 packet loss percentage is ~53%.
I verified there is NOT much loss at hardware level using "ethtool -S ethX" command both on client side and server side, while at the appln level I see a loss of 53% as said above.
Hence to reduce packet loss I tried these:
- Increased the priority of my sample program using renice command.
- Increased Receive Buffer size (both at system level and process level)
Bump up the priority to -20:
renice -20 2022
2022 (process ID) old priority 0, new priority -20
Bump up the receive buf size to 16MB:
At Process Level:
int sockbufsize = 16777216;
setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF,(char *)&sockbufsize, (int)sizeof(sockbufsize))
At Kernel Level:
cat /proc/sys/net/core/rmem_default
16777216
cat /proc/sys/net/core/rmem_max
16777216
After these changes, performed Test2.
Test2:
When 1985076 packets are flooded from UDP client, my sample program receives 1848791 packets and the remaining 136286 packets is getting dropped by kernel as shown in /proc/net/snmp output.
cat /proc/net/snmp | grep Udp:
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 1849064 0 136286 236 0 0
For Test2 packet loss percentage is 6%.
Packet loss is reduced significantly. But I have the following questions:
Can the packet loss be further reduced?!? I know I am greedy here :) But I am just trying to find out if its possible to reduce packet loss further.
Unlike Test1, in Test2 InErrors doesnt match RcvbufErrors and RcvbufErrors is always zero. Can someone explain the reason behind it, please?!? What exactly is the difference between InErrors and RcvbufErrors. I understand RcvbufErrors but NOT InErrors.
Thanks for your help and time!!!

Tuning the Linux kernel's networking stack to reduce packet drops is a bit involved as there are a lot of tuning options from the driver all the way up through the networking stack.
I wrote a long blog post explaining all the tuning parameters from top to bottom and explaining what each of the fields in /proc/net/snmp mean so you can figure out why those errors are happening. Take a look, I think it should help you get your network drops down to 0.

If there aren't drops at hardware level then should be mostly a question of memory, you should be able to tweak the kernel configuration parameters to reach 0 drops (obviously you need a reasonable balanced hardware for the network traffic you're recv'ing).
I think you're missing netdev_max_backlog which is important for incoming packets:
Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.

InErrors is composed of:
corrupted packets (incorrect headers or checksum)
full RCV buffer size
So my guess is you have fixed the buffer overflow problem (RcvbufErrors is 0) and what is left are packets with incorrect checksums.

Related

iperf UDP over IPv6

I'm doing some UDP bandwidth tests using iperf (https://iperf.fr/) over IPv6. I have very bad results when using a Linux UDP client with the following command line:
iperf -u -V -c fe80::5910:d4ff:fe31:5b10%usb0 -b 100m
Investigating the issue with Wireshark I have seen there is some fragmentation while the client is sending data. To be more precise, I see UDP client outgoing packets with size 1510 bytes and 92 bytes, alternating.
For example, the UDP packets that I see have the following pattern (in size): 1510, 92, 1510, 92, 1510, 92,...,1510, 92,...
Reading iperf2 documentation I read the following for option (-l) :
The length of buffers to read or write. iPerf works by writing an array of len bytes a number of times. Default is 8 KB for TCP, 1470 bytes for UDP. Note for UDP, this is the datagram size and needs to be lowered when using IPv6 addressing to 1450 or less to avoid fragmentation. See also the -n and -t options.
I have tried to do the same bandwidth test by replacing the Linux iperf UDP client command line with the following:
iperf -u -V -c fe80::5910:d4ff:fe31:5b10%usb0 -b 100m -l1450
and I see good results. Looking at the Wireshark capture I see no fragmentation anymore.
Doing the same test over IPv4 I don't need to change the default UDP datagram size (I don't need to use '-l' option) to get good results.
So my conclusion is that fragmentation (over IPv6) is responsible for poor bandwidth performances.
Anyway, I'm wondering what really happens when setting UDP datagram size to 1450 over IPv6. Why do I have fragmentation over IPv6 and not over IPv4 with default value for UDP datagram size? Moreover, why do I have no fragmentation when reducing the UDP datagram size to 1450?
Thank you.

The base IPv4 header is 20 bytes, the base IPv6 header is 40 bytes and the UDP header is 8 bytes.
With IPv4 the total packet size is 1470+8+20=1498, which is less than the default ethernet MTU of 1500.
With IPv6 the total packet size is 1470+8+40=1518, which is more than 1500 and has to be fragmented.
Now let's look into your observations. You see packets of size 1510 and 92. Those include the ethernet header, which is 14 bytes. Your IPv6 packets are therefore 1496 and 78 bytes. The contents of the big packet are: IPv6 header (40 bytes), a fragmentation header (8), the UDP header (8) and 1440 bytes of data. The smaller packet contains the IPv6 header (40), a fragmentation header (8) and the remaining 30 bytes of data.

The most common MTU for Ethernet is 1500, not including ethernet frame headers. This means you can send 1500 bytes in one packet over the wire, including IP headers. IPv6 headers are larger than IPv4 headers for several reasons, with the most important being that IPv6 addresses are larger than IPv4. So when you run with the default value over IPv6 your packet size goes over the MTU size and the packet needs to be split into two; a procedure known as fragmentation.

UDP tuning linux

I have C application which transmits UDP stream. It works well in most of servers, but its crazy on few servers.
I have 100 Mbps network connection say eth1 on server. Using this network I usually transmit (TX) around 10-30 Mbps UDP streams, and this network connection will have around 100-300 Kbps RX to server. I have other network connection say eth0 in server from which C application receives UDP streams and forwards to 100 Mbps network connection, eth1.
My application uses blocking sendto() function to transmit UDP packets in eth1. Packets are of variable length, from 17 bytes to maximum 1333 bytes. But most of time, more than 1000 bytes.
The problem is: sometime sendto function blocks on eth1 for huge time around 1 second. This happens once in every 30 seconds to 3 minutes. When sendto blocks, I will have lot of UDP packets buffered in UDP receive buffer from eth0 by kernel, from where C application receive packets. Once sendto returns from long blocking call on eth1, C application will have lot of buffered packets to transmit from eth0. And then C application transmits all these buffered packets with next sendto calls. This will create spike in rate at other endpoint which receives UDP stream from eth1. This will create Z like rate graph at other endpoint. So this Z like spike in rate is my problem.
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike. And setting this resolves my issue of spike. Now I don't get Z like spike in rate at other endpoint, but I got new issue. The new issue is: I get lot of packet losses in place of spike. I think it may be due to send buffer of eth1 accumulating lot of packets to send while sending current packet from eth1 takes lot of time (this is why may be sendto blocking long). And at next instant when NIC sends all accumulated packets from send buffer in short time, this may be causing network congestion and I may be getting lot of packet losses instead of spike.
So, this is second problem. But I think root cause is: why sometime NIC pauses for long time while sending traffic, once in every 30 seconds to 3 minutes?
May be I need to look in TX ring buffer of driver of eth1? When socket send buffer gets full due to NIC not transmitting all in time (due to random long TX pauses), then next call to sendto blocks for room in socket send buffer, does that also blocks for room in driver TX ring buffer?
Please dont tell me that UDP is unreliable and we can't control packet losses. I know that its unreliable and UDP packets can be lost. But I am sure still we can do something to minimize packet losses.
EDIT
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike. And also I have removed blocking sendto call. Now I use like: sendto(sockfd, buf, len, MSG_DONTWAIT ,dest_addr, addrlen); with large send buffer using wmem_default. Also I am not getting any EAGAIN or EWOULDBLOCK errors on sendto due to large send buffer, but still packets loosing in place of spike.
EDIT
As non-blocking sendto call with huge wmem_default, and as NO any EAGAIN or EWOULDBLOCK errors from sendto, spikes have been removed because no much packets accumulating in receive buffer of eth0. I think its possible solution from application side. But main problem is why NIC slows every few moments? What can be possible reasons? While it resumes from long TX pause, and may be it will have lot of packets accumulated in send buffer, which will be sent as burst next moment and congesting network so lot of packet losses.
More update
I use same this C application to transmit locally in machine (127.0.0.1), and I never get any spikes or packet losses problems locally.

The problem is: sometime sendto function blocks on eth1 for huge time around 1 second.
Blocking sendto may block, surprisingly.
The problem is: sometime sendto function blocks on eth1 for huge time around 1 second.
It could be that IP stack is performing path MTU discovery:
While MTU discovery is in progress, initial packets from datagram sockets may be dropped. Applications using UDP should be aware of this and not take it into account for their packet retransmit strategy.
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike.
Be careful with increasing buffer sizes. After a certain limit increasing buffer sizes only increases the amount of queuing and hence delay, leading to the infamous bufferbloat.
You may also play around with NIC Queuing Disciplines, they are responsible for dropping outgoing packets.

How to programmatically increase the per-socket buffer for UDP sockets on LInux?

I'm trying to understand the correct way to increase the socket buffer size on Linux for our streaming network application. The application receives variable bitrate data streamed to it on a number of UDP sockets. The volume of data is substantially higher at the start of the stream and I've used:
# sar -n UDP 1 200
to show that the UDP stack is discarding packets and
# ss -un -pa
to show that each socket Recv-Q length grows to the nearly the limit (124928. from sysctl net.core.rmem_default) before packets are discarded. This implies that the application simply can't keep up with the start of the stream. After discarding enough initial packets the data rate slows down and the application catches up. Recv-Q trends towards 0 and remains there for the duration.
I'm able to address the packet loss by substantially increasing the rmem_default value which increases the socket buffer size and gives the application time to recover from the large initial bursts. My understanding is that this changes the default allocation for all sockets on the system. I'd rather just increase the allocation for the specific UDP sockets and not modify the global default.
My initial strategy was to modify rmem_max and to use setsockopt(SO_RCVBUF) on each individual socket. However, this question makes me concerned about disabling Linux autotuning for all sockets and not just UDP.
udp(7) describes the udp_mem setting but I'm confused how these values interact with the rmem_default and rmem_max values. The language it uses is "all sockets", so my suspicion is that these settings apply to the complete UDP stack and not individual UDP sockets.
Is udp_rmem_min the setting I'm looking for? It seems to apply to individual sockets but global to all UDP sockets on the system.
Is there a way to safely increase the socket buffer length for the specific UDP ports used in my application without modifying any global settings?
Thanks.

Jim Gettys is armed and coming for you. Don't go to sleep.
The solution to network packet floods is almost never to increase buffering. Why is your protocol's queueing strategy not backing off? Why can't you just use TCP if you're trying to send so much data in a stream (which is what TCP was designed for).

UDP IP Fragmentation and MTU

I'm trying to understand some behavior I'm seeing in the context of sending UDP packets.
I have two little Java programs: one that transmits UDP packets, and the other that receives them. I'm running them locally on my network between two computers that are connected via a single switch.
The MTU setting (reported by /sbin/ifconfig) is 1500 on both network adapters.
If I send packets with a size < 1500, I receive them. Expected.
If I send packets with 1500 < size < 24258 I receive them. Expected. I have confirmed via wireshark that the IP layer is fragmenting them.
If I send packets with size > 24258, they are lost. Not Expected. When I run wireshark on the receiving side, I don't see any of these packets.
I was able to see similar behavior with ping -s.
ping -s 24258 hostA works but
ping -s 24259 hostA fails.
Does anyone understand what may be happening, or have ideas of what I should be looking for?
Both computers are running CentOS 5 64-bit. I'm using a 1.6 JDK, but I don't really think it's a programming problem, it's a networking or maybe OS problem.

Implementations of the IP protocol are not required to be capable of handling arbitrarily large packets. In theory, the maximum possible IP packet size is 65,535 octets, but the standard only requires that implementations support at least 576 octets.
It would appear that your host's implementation supports a maximum size much greater than 576, but still significantly smaller than the maximum theoretical size of 65,535. (I don't think the switch should be a problem, because it shouldn't need to do any defragmentation -- it's not even operating at the IP layer).
The IP standard further recommends that hosts not send packets larger than 576 bytes, unless they are certain that the receiving host can handle the larger packet size. You should maybe consider whether or not it would be better for your program to send a smaller packet size. 24,529 seems awfully large to me. I think there may be a possibility that a lot of hosts won't handle packets that large.
Note that these packet size limits are entirely separate from MTU (the maximum frame size supported by the data link layer protocol).

I found the following which may be of interest:
Determine the maximum size of a UDP datagram packet on Linux
Set the DF bit in the IP header and send continually larger packets to determine at what point a packet is fragmented as per Path MTU Discovery. Packet fragmentation should then result in a ICMP type 3 packet with code 4 indicating that the packet was too large to be sent without being fragmented.
Dan's answer is useful but note that after headers you're really limited to 65507 bytes.

Discard incoming UDP packet without reading

In some cases, I'd like to explicitly discard packets waiting on the socket with as little overhead as possible. It seems there's no explicit "drop udp buffer" system call, but maybe I'm wrong?
The next best way would be probably to recv the packet to a temporary buffer and just drop it. It seems I can't receive 0 bytes, since man says about recv: The return value will be 0 when the peer has performed an orderly shutdown. So 1 is the minimum in this case.
Is there any other way to handle this?
Just in case - this is not a premature optimisation. The only thing this server is doing is forwarding / dispatching the UDP packets in a specific way - although recv with len=1 won't kill me, I'd rather just discard the whole queue in one go with some more specific function (hopefully lowering the latency).

You can have the kernel discard your UDP packets by setting the UDP receive buffer to 0.
int UdpBufSize = 0;
socklen_t optlen = sizeof(UdpBufSize);
setsockopt(socket, SOL_SOCKET, SO_RCVBUF, &UdpBufSize, optlen);
Whenever you see fit to receive packets, you can then set the buffer to, for example, 4096 bytes.

I'd rather just discard the whole queue in one go
Since this is UDP we are talking here: close(udp_server_socket) and socket()/bind() again?
To my understanding should work.

man says about recv: the return value will be 0 when the peer has performed an orderly shutdown.
That doesn't apply to UDP. There is no "connection" to shut down in UDP. A return value of 0 is perfectly valid, it just means datagram with no payload was received (i.e. the IP and UDP headers only.)
Not sure that helps your problem or not. I really don't understand where you are going with the len=1 stuff.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string