Linux TCP: high Send-Q on sender, zero Recv-Q on receiver

Linux TCP: high Send-Q on sender, zero Recv-Q on receiver - linux

How can it be that:
There is a TCP socket between two machines
After some succesful bidirectional communications, sender application is stuck on writing to the socket and receiver on reading from it
netstat reports high Send-Q (a few megabytes) for the socket on the sender (and the value does not change even after a couple of hours of waiting)
netstat reports zero Recv-Q for the socket on the receiver
tcpdump reports that the only activity on the socket is a periodic (biminutely) ACK with no data from the sender and immediate ACK response with no data from the receiver
Why doesn't the sender machine attempt to send queued data to the receiver?

I my case, client was writing data in chunks of 8KB and server was trying to read 8KB and server would then write it to RAID0 disks. For uploading large files, I faced a similar situation and increasing the amount of data I was reading from socket on server side helped. I bumped up the size of internal buffer that was reading from socket to 1MB (from 8 kB) and it helped. I don't know for sure whether it was because of RAID or tcp but it could be another thing you might want to try out.

This is more likely caused by other problem, but below might help if you haven't tried (these numbers are examples, find your own numbers):
Estimate your sender and receiver file system read/write speed as well as network speed, and set appropriate bandwidth limit in rsync: --bwlimit=1024 (1024 KBps)
If sender and receiver have dedicated NIC in this local network, do yourself a favor, increase MTU on these NICs: ifconfig eth1 mtu 65744
Increase sender transmission queue length: ifconfig eth1 txqueuelen 4096
Increase kernel send/receive memory: add these to /etc/sysctl.conf file
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem=4096 262144 16777216
net.ipv4.tcp_wmem=4096 262144 16777216
run sysctl -p afterwards.
If you rsync a very large file system, make sure fs.file-max is large enough,to check it: sysctl fs.file-max
to increase it, add a line fs.file-max=327679 to the file /etc/sysctl.conf,
on your rsync user, run: ulimit -n 327679

I have the same problem， maybe the conntrack have been delete in receiver side, check your
/proc/net/nf_conntrack file, if there is no information about this socket, then there is the problem
also see: Connection Reset By Peer - with driver 2.8.0 and mongo 4.0.9 on a k8s cluster

Related

Linux UDP: where does the UDP datagram lose?

I have a application running on sender and receiver, with UDP protocol. The UDP buffer size is about 70 or 1024 bytes so there is no UDP fragmentation happens.
From ifconfig/sar level, I did not see significant UDP loss.
But from application level, I see ~30% loss. I see the same with iperf3/ntttcp-for-Linux/netperf.
Where does the loss happen? Is this caused by UDP arrives IP stack out-of-order? How can I confirm this assumption?
thanks!

It turned out that the receiver buffer is too small.
On the receiver side, netstat reports high "UDP: packet receive errors" in netstat -s.
Problem is solved by enlarging the receive buffer:
# sysctl -w net.core.rmem_max=33554432
# sysctl -w net.core.rmem_default=33554432

Linux TCP: packet segmentation?

I am working on a virtualization environment (Linux over HyperV). The Linux driver for the virtual NIC supports TSO and GSO (tcp segmentation is ON and generic segmentation is ON).
Now, I create TCP socket and the send buffer set to 128K.
But based on ifconfig data (TX bytes and TX packets), the average packet size is about 11 K.
So my question is, where is my packet be segmented (from 128K to 11K)? How do I control/configure this in socket options or TCP options?
thanks!
===========EDIT==================
I have an application which can reach 8Gbps throughput in a 10G network with 32 TCP connections - in this case, the average packet size is about 20 Kbytes which is pretty good; but when I increased the TCP connections to 256, then the throughput is just about 1Gbps as the packet size on NIC is down to about 3 KBytes.
I know the packet size is critical to the performance as the cost of processing traffic is per packet, not per bytes, so the packet on NIC, it is better if bigger.
SO, MY QUESTION IS: how do I increase the TCP packet size? Is there any TCP settings control this?

Your question seems a little bit confusing, but there are a number of settings that you need to play with to get 10GigE to work right on Linux.
See here:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/

Setting the socket option SO_SNDBUF, SO_RCVBUF might help, but the TCP IP does not guarantees the chunks size when received / sent.

TCP_MAXSEG inaccurate? (Was: Linux path MTU probing not working on accept():ed socket if requested using setsockopt())

Question update: In addition to the problem below, it seems our client/server application using the Linux PLPMTUD mechanism gets too large path MTU. Has anyone seen this, i.e. actual path MTU being 1500, but getsockopt() w TCP_MAXSEG returning the MTU:s of the endpoints, in our case 3000? I have tried turning of GRO, GSO and TSO with ethtool but the error persists. Normal ping only manages to push through packets 1472 bytes or smaller. Also worth mentioning is that PLPMTUD works perfectly for smaller MTU:s. For example, w endpoints at 1500 MTU and one interface of the intermediate router set to e.g 1200 MTU, the kernel TCP probes and reports correct TCP_MAXSEG (1200 - headers).
I am using the Linux RFC4821-compliant packetization layer path MTU discovery in an application. Basically, the client does a setsockopt on a TCP socket:
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &sopt, sizeof(sopt));
with option value set to IP_PMTUDISC_PROBE. The setsockopt() does not return an error.
The client sends large tcp packets to a discard server, and the path MTU is calibrated by Linux kernel - tcpdump shows tcp packets with DF bit set being sent, packet size varies until the kernel knows the path MTU. However, to get this to work in the other direction (listening server accept:ing connections from clients, sending data and calibrating PMTU in direction from server to client) I have to set the global option for tcp path mtu discovery, /proc/sys/net/ipv4/tcp_mtu_probing. If I do not, server will stupidly continue to send too large packets, which get discarded by an intermediate router without ICMP sent back. Both endpoints have an MTU set to 3000, while the intermediate hops have MTU 1500.
I hope someone has an idea on what goes wrong. If more info is needed, let me know and I edit the question. Problems exist on both Linux kernel 4.2.0 and 3.19.0, both are stock Kubuntu LTS kernels. (x86/x86-64)
I do set the same socket option server-side as well, on all accept:ed sockets, before sending data in reverse direction.

FWIW, I have found workarounds/solutions for the problems, will do more testing but shortly describe my findings here, in case it helps someone else.
The problem with not being able to set path mtu discovery per socket was fixed, brute force, by enabling it system-wide during execution of my program, then disabling again.
The second problem, of incorrect path mtu returned by getsockopt TCP_MAXSEG, was fixed by waiting for TCP ACK of sent TCP data, also using getsockopt (tcp_info.tcpi_unacked). That way, I can be sure that probing has finished before I get TCP_MAXSEG.
Finally, there was a patchset for improving path mtu probing accuracy merged to mainline Linux kernel in March 2015. Without those patches, the probing is very imprecise. Patchset is part of 4.1.y-series kernels and onward.

UDP packet greater than 1500 bytes dropped

I'm developing a tftp client and server and I want to dynamically select the udp payload size to boost transfer performance.
I have tested it with two linux machines ( one has a gigabit ethernet card, the other a fast ethernet one ). I changed the MTU of the gigabit card to 2048 bytes and left the other to 1500.
I have used setsockopt(sockfd, IPPROTO_IP, IP_MTU_DISCOVER, &optval, sizeof(optval)) to set the MTU_DISCOVER flag to IP_PMTUDISC_DO.
From what I have read this option should set the DF bit to one and so it should be possible to find the minimum MTU of the network ( the MTU of the host that has the lowest MTU ). However this thing only gives me an error when I send a packet which size is bigger than the MTU of the machine from which I'm sending packets.
Also the other machine ( the server in this case ) doesn't receive the oversized packets ( the server has a MTU of 1500 ). All the UDP packets are dropped, the only way is to send packets of 1472 bytes.
Why the hosts do this? From what I have read, if I send a packet larger than MTU, the ip layer should fragment it.

I fail to see the problem. You are setting the "don't fragment" bit, and you send a package smaller than the sending host's MTU, but larger than the receiving host's MTU. Of course nobody will fragment here (doing so would violate the DF bit). Instead, the sending host should get an ICMP message back.
Edit: IP specifies that an ICMP error message type 3 (destination unreachable) code 4 (Fragmentation Required but DF Bit Is Set) is sent to the originating host at the point where the fragmentation would have occurred. The TCP layer handles this on its own for PMTU discovery. On connection-less sockets, Linux reports the error in the socket's error queue if the IP_RECVERR option is activated; see ip(7).

That "DF bit" you're setting, stands for "Don't Fragment". The IP layer should not be expected to fragment packets when you've told it not to.

It is not correct to run hosts with different interface MTUs on the same subnet1.
This is a host/network misconfiguration, and IP path MTU discovery is not expected to work correctly in this situation.
If you wish to test your application's path MTU discovery, you will need to set up multiple subnets connected by a router2, with different MTUs. In this situation, the router is the device that will pick up the MTU mismatch, and send back an ICMP "Fragmentation Needed" error.
1. Well, technically, same broadcast domain.
2. The devices sold as "home routers" are really router/switches - they route between the WAN and the LAN, but switch between the ethernet ports on the LAN. This isn't sufficient to separate networks with different MTUs.

How to set the maximum TCP Maximum Segment Size on Linux?

In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.

Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.

ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500

The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?

MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.

You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string