I am working on a virtualization environment (Linux over HyperV). The Linux driver for the virtual NIC supports TSO and GSO (tcp segmentation is ON and generic segmentation is ON).
Now, I create TCP socket and the send buffer set to 128K.
But based on ifconfig data (TX bytes and TX packets), the average packet size is about 11 K.
So my question is, where is my packet be segmented (from 128K to 11K)? How do I control/configure this in socket options or TCP options?
thanks!
===========EDIT==================
I have an application which can reach 8Gbps throughput in a 10G network with 32 TCP connections - in this case, the average packet size is about 20 Kbytes which is pretty good; but when I increased the TCP connections to 256, then the throughput is just about 1Gbps as the packet size on NIC is down to about 3 KBytes.
I know the packet size is critical to the performance as the cost of processing traffic is per packet, not per bytes, so the packet on NIC, it is better if bigger.
SO, MY QUESTION IS: how do I increase the TCP packet size? Is there any TCP settings control this?
Your question seems a little bit confusing, but there are a number of settings that you need to play with to get 10GigE to work right on Linux.
See here:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/
Setting the socket option SO_SNDBUF, SO_RCVBUF might help, but the TCP IP does not guarantees the chunks size when received / sent.
Related
How can it be that:
There is a TCP socket between two machines
After some succesful bidirectional communications, sender application is stuck on writing to the socket and receiver on reading from it
netstat reports high Send-Q (a few megabytes) for the socket on the sender (and the value does not change even after a couple of hours of waiting)
netstat reports zero Recv-Q for the socket on the receiver
tcpdump reports that the only activity on the socket is a periodic (biminutely) ACK with no data from the sender and immediate ACK response with no data from the receiver
Why doesn't the sender machine attempt to send queued data to the receiver?
I my case, client was writing data in chunks of 8KB and server was trying to read 8KB and server would then write it to RAID0 disks. For uploading large files, I faced a similar situation and increasing the amount of data I was reading from socket on server side helped. I bumped up the size of internal buffer that was reading from socket to 1MB (from 8 kB) and it helped. I don't know for sure whether it was because of RAID or tcp but it could be another thing you might want to try out.
This is more likely caused by other problem, but below might help if you haven't tried (these numbers are examples, find your own numbers):
Estimate your sender and receiver file system read/write speed as well as network speed, and set appropriate bandwidth limit in rsync: --bwlimit=1024 (1024 KBps)
If sender and receiver have dedicated NIC in this local network, do yourself a favor, increase MTU on these NICs: ifconfig eth1 mtu 65744
Increase sender transmission queue length: ifconfig eth1 txqueuelen 4096
Increase kernel send/receive memory: add these to /etc/sysctl.conf file
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem=4096 262144 16777216
net.ipv4.tcp_wmem=4096 262144 16777216
run sysctl -p afterwards.
If you rsync a very large file system, make sure fs.file-max is large enough,to check it: sysctl fs.file-max
to increase it, add a line fs.file-max=327679 to the file /etc/sysctl.conf,
on your rsync user, run: ulimit -n 327679
I have the same problem, maybe the conntrack have been delete in receiver side, check your
/proc/net/nf_conntrack file, if there is no information about this socket, then there is the problem
also see: Connection Reset By Peer - with driver 2.8.0 and mongo 4.0.9 on a k8s cluster
In a linux system, I use scapy to send a high frequency UDP ping. For example: each 20 milliseconds, send a UDP packet; a total of 100. But I can only get the first few ICMP port unreachable answer.
pkt = IP(dst=dst)/UDP(dport=RandShort())
ans,_ = sr(pkt*100, inter=0.02, timeout=3)
I tried to use tcpdump to capture packet and found that all UDP packets have been sent to the target machine, but only a few ICMP packet came back to the source machine. What would cause this?
If I use ICMP ping,this does not happen.
I guess:
may be caused by the target machine's system kernel parameter which process icmp packet
may be caused by the icmp packet routing switch strategies.
The rate of ICMP packets is hard limited by the kernel to prevent DDOS attacks. Usually to only 1 packet per second. Almost impossible to get anything faster than that in any external (internet) router. Example
How to make a TCP dump where it is guaranteed that all the packets that really pass through the network are captured, and nothing is missed?
Details:
We have an issue with 3rd party vendor who provides a solution on top of SCTP stack, which he also implements.
Under quite high throughput (52 000 messages/sec, average message size is 500 bytes) the SCTP link breaks.
We believe that the bug is in the vendor SCTP stack.
But the vendor says, this happens because SCTP stack sends a message, doesn't receive ACK on it, sends a number of retransmits, doesn't receive ACKs on them as well and closes the SCTP link.
So the vendor says, this is the network which is guilty, because it loses packets.
In the TCP dumps on both sides, client and server we see that the original messages reaches the server and see that the server doesn't answer with ACK. But the vendor says that TCP dumps are not reliable, that when capturing a TCP dump, some packets could be not captured, because libpcap library works only within one hardware thread, its power can be not enough to log all the packets.
Technical data:
52 000 messages/sec, average message size is 500 bytes, so 26 MB/sec in total, 4 SCTP links are used.
Hardware: CPU E5-2670, 2.6 GHz, 8 HW threads
Network: 10 GBit, the traffic is between HP blades, which are located in one rack.
RHEL 6.
I'm developing a tftp client and server and I want to dynamically select the udp payload size to boost transfer performance.
I have tested it with two linux machines ( one has a gigabit ethernet card, the other a fast ethernet one ). I changed the MTU of the gigabit card to 2048 bytes and left the other to 1500.
I have used setsockopt(sockfd, IPPROTO_IP, IP_MTU_DISCOVER, &optval, sizeof(optval)) to set the MTU_DISCOVER flag to IP_PMTUDISC_DO.
From what I have read this option should set the DF bit to one and so it should be possible to find the minimum MTU of the network ( the MTU of the host that has the lowest MTU ). However this thing only gives me an error when I send a packet which size is bigger than the MTU of the machine from which I'm sending packets.
Also the other machine ( the server in this case ) doesn't receive the oversized packets ( the server has a MTU of 1500 ). All the UDP packets are dropped, the only way is to send packets of 1472 bytes.
Why the hosts do this? From what I have read, if I send a packet larger than MTU, the ip layer should fragment it.
I fail to see the problem. You are setting the "don't fragment" bit, and you send a package smaller than the sending host's MTU, but larger than the receiving host's MTU. Of course nobody will fragment here (doing so would violate the DF bit). Instead, the sending host should get an ICMP message back.
Edit: IP specifies that an ICMP error message type 3 (destination unreachable) code 4 (Fragmentation Required but DF Bit Is Set) is sent to the originating host at the point where the fragmentation would have occurred. The TCP layer handles this on its own for PMTU discovery. On connection-less sockets, Linux reports the error in the socket's error queue if the IP_RECVERR option is activated; see ip(7).
That "DF bit" you're setting, stands for "Don't Fragment". The IP layer should not be expected to fragment packets when you've told it not to.
It is not correct to run hosts with different interface MTUs on the same subnet1.
This is a host/network misconfiguration, and IP path MTU discovery is not expected to work correctly in this situation.
If you wish to test your application's path MTU discovery, you will need to set up multiple subnets connected by a router2, with different MTUs. In this situation, the router is the device that will pick up the MTU mismatch, and send back an ICMP "Fragmentation Needed" error.
1. Well, technically, same broadcast domain.
2. The devices sold as "home routers" are really router/switches - they route between the WAN and the LAN, but switch between the ethernet ports on the LAN. This isn't sufficient to separate networks with different MTUs.
In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.
Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.
ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500
The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?
MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.
You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.