TCP receiving window size higher than net.core.rmem_max - linux

I am running iperf measurements between two servers, connected through 10Gbit link. I am trying to correlate the maximum window size that I observe with the system configuration parameters.
In particular, I have observed that the maximum window size is 3 MiB. However, I cannot find the corresponding values in the system files.
By running sysctl -a I get the following values:
net.ipv4.tcp_rmem = 4096 87380 6291456
net.core.rmem_max = 212992
The first value tells us that the maximum receiver window size is 6 MiB. However, TCP tends to allocate twice the requested size, so the maximum receiver window size should be 3 MiB, exactly as I have measured it. From man tcp:
Note that TCP actually allocates twice the size of the buffer requested in the setsockopt(2) call, and so a succeeding getsockopt(2) call will not return the same size of buffer as requested in the setsockopt(2) call. TCP uses the extra space for administrative purposes and internal kernel structures, and the /proc file values reflect the larger sizes compared to the actual TCP windows.
However, the second value, net.core.rmem_max, states that the maximum receiver window size cannot be more than 208 KiB. And this is supposed to be the hard limit, according to man tcp:
tcp_rmem
max: the maximum size of the receive buffer used by each TCP socket. This value does not override the global net.core.rmem_max. This is not used to limit the size of the receive buffer declared using SO_RCVBUF on a socket.
So, how come and I observe a maximum window size larger than the one specified in net.core.rmem_max?
NB: I have also calculated the Bandwidth-Latency product: window_size = Bandwidth x RTT which is about 3 MiB (10 Gbps # 2 msec RTT), thus verifying my traffic capture.

A quick search turned up:
https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/net/ipv4/tcp_output.c
in void tcp_select_initial_window()
if (wscale_ok) {
/* Set window scaling on max possible window
* See RFC1323 for an explanation of the limit to 14
*/
space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
space = min_t(u32, space, *window_clamp);
while (space > 65535 && (*rcv_wscale) < 14) {
space >>= 1;
(*rcv_wscale)++;
}
}
max_t takes the higher value of the arguments. So the bigger value takes precedence here.
One other reference to sysctl_rmem_max is made where it is used to limit the argument to SO_RCVBUF (in net/core/sock.c).
All other tcp code refers to sysctl_tcp_rmem only.
So without looking deeper into the code you can conclude that a bigger net.ipv4.tcp_rmem will override net.core.rmem_max in all cases except when setting SO_RCVBUF (whose check can be bypassed using SO_RCVBUFFORCE)

net.ipv4.tcp_rmem takes precedence net.core.rmem_max according to https://serverfault.com/questions/734920/difference-between-net-core-rmem-max-and-net-ipv4-tcp-rmem:
It seems that the tcp-setting will take precendence over the common max setting
But I agree with what you say, this seems to conflict with what's written in man tcp, and I can reproduce your findings. Maybe the documentation is wrong? Please find out and comment!

Related

Why isn't increasing the networking buffer sizes reducing packet drops?

Running Ubuntu 18.04.4 LTS
I have a high-bandwidth file transfer application (UDP) that i'm testing locally using the loopback interface.
With no simulated latency, I can transfer a 1GB file at maximum speed with <1% packet loss. To achieve this, I had to increase the networking buffer sizes from ~200KB to 8MB:
sudo sysctl -w net.core.rmem_max=8388608
sudo sysctl -w net.core.wmem_max=8388608
sudo sysctl -p
For additional testing, I wanted to add a simulated latency of 100ms. This is intended to simulate propagation delay, not queuing delay. I accomplished this using the Linux traffic control (tc) tool:
sudo tc qdisc add dev lo root netem delay 100ms
After adding the latency, packet loss for the 1GB transfer at maximum speed went from <1% to ~97%. In a real network, latency caused by propagation delay shouldn't cause packet loss, so I think the issue is that to simulate latency the kernel would have to store packets in RAM while applying the delay. Since my buffers were only set to 8MB, it made sense that a significant amount of packets would be dropped if simulated latency was added.
I increased my buffer sizes to 50MB:
sudo sysctl -w net.core.rmem_max=52428800
sudo sysctl -w net.core.wmem_max=52428800
sudo sysctl -p
However, there was no noticeable reduction in packet loss. I also attempted 1GB buffer sizes with similar results (my system has >90GB of RAM available).
Why did increasing system network buffer sizes not work in this case?
For some versions of tc, if you do not specify a buffer count limit, tc will default to 1000 buffers.
You can check how many buffers tc is currently using by running:
tc -s qdisc ls dev <device>
For example on my system, where I’ve simulated a 0.1s delay on the eth0 interface I get:
$ tc -s qdisc ls dev eth0
qdisc netem 8024: root refcnt 2 limit 1000 delay 0.1s
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
This shows that I have limit 1000 buffers available to fill during my 0.1s delay period. If I go over this many buffers in my delay timeframe, the system will start dropping packets. Thus this means I have a packet per second (pps) limit of:
pps = buffers / delay
pps = 1000 / 0.1
pps = 10000
If I go beyond this limit, the system will be forced to either drop the incoming packet right away or replace a queued packet, dropping it instead.
Since we don’t normally think of network flows in pps, it’s useful to convert from pps to Bps, KBps, or GBps. This can be done by multiplying by either the network MTU (generally 1500 bytes), the buffer size (varies by system), or ideally by the observed average number of bytes per packet seen by your system on the given interface. Since we don’t know the average bytes per packet, or buffer size of your system at the moment, we’ll fallback to using the typical MTU.
byte rate = pps * bytes per packet
byte rate = 10000pps * 1500 bytes per packet
byte rate = 15000000 Bytes per second
byte rate = 15 MBps
If we are talking about a loopback interface that normally runs at an average of say ~5 Gbps, such as what iperf3 reports for the loopback interface on this MacBook, we can see the problem right away, in that our tc limit of 1.5 MBps is far less than the interface’s practical limit of ~5 GBps.
So if we were transferring a 1GB file over the loopback interface of this system, it should take:
time = file size / byte rate
time = 1Gb / 5GBps
time = 0.2 seconds
To transfer the file across the loopback interface.And the loss, assuming packet size matches buffer size, would be:
packets lost = packets - ((packets that fit in buffers) + (drain rate of buffers * timeframe))
packets lost = (file size / MTU) - ((buffer count) + (drain rate * timeframe))
packets lost = (1 GB / 1500 bytes) - ((10000) + (10000Hz * 0.2 seconds))
packets lost = 654667
And that’s out of:
packets = (file size / MTU)
packets = (1 GB / 1500 bytes)
packets = 666667
So in all that would be a loss percentage of:
loss % = 100 * (lost) / (total)
loss % = 100 * 654667 / 666667
loss % = 98.2%
Which happens to be roughly in line with what you are seeing.
So why didn’t increasing the system buffer size impact your losses? After all the buffer size is part of the computation.
The answer there, is that the method you are using to transmit your file is likely chunking according to it’s best guess at the MTU (likely 1500 bytes), and the packets only make use of the first 1500 bytes of your extra large buffers.
Thus the solution should probably be to increase the number of buffers available to tc instead of increasing the system buffer size. But how many buffers do you need for this link? Based off of this answer the recommendation is to use 150% of the expected number of packets for your delay, so that’s:
buffers = (network rate / avg packet size) * delay * 150%
buffers = (5GBps / 1500B) * 0.1s * 150%
buffers = 333000 * 150%
buffers = 500000
You can see right away that that’s 500 times as many buffers as tc tries to use by default, or to put it another way you only had 2% of the buffers you needed so you saw 98% loss.
Thus to fix your problem, try changing your tc command from something like:
sudo tc qdisc add dev <device> root netem delay 0.1s
To something like:
sudo tc qdisc add dev <device> root netem delay 0.1s limit 500000
To my knowledge, even though its not what you are trying to achieve.. you should probably throtlle up the speed at which you are sending UDP packets because indeed as pointed out by #user3878723 buffers will quickly fill up and packets will be lost. Said differently - quite like #Ron Maupin - when applying delay the interface gets congested. I don't think the emitting process is aware of the 100ms delay so it might overwhelm all available resources quickly.
Instead you may have to tweak something like a Token Bucket Filter (TBF) if you want to go farther in your very use case. Also consider "Rate control".
UPDATE
It could be worth modifying these parameters and make them persistent
net.core.rmem_default
net.core.wmem_default
And/Or make sure you are using correctly these options in your emitter/receiver:
SO_SNDBUF
SO_RCVBUF
So that the whole chain has enough buffer.

Setting Socket Receive Buffer Size, gets truncated to 244KB

I'm trying to increase the size of my socket receive buffer size using setsockopt() on linux. I can set it successfully to any value below 244KB. Any value above 244KB gets truncated to 244KB.
There appears to be some sort of system limit in place, but I can't figure where it is coming from, as it doesn't correspond with values below:
$ cat /proc/sys/net/ipv4/tcp_rmem
4096 87380 4194304
$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 4194304
$ cat /proc/sys/net/core/rmem_default
124928
$ cat /proc/sys/net/core/wmem_default
124928
The default value is 87380 as expected, but I can't increase it to 4194304. It gets limited to 244KB. Interestingly that value is 2X rmem_default, do I need to change that?
Thanks
From man page for TCP:
The maximum sizes for socket buffers declared via the SO_SNDBUF and
SO_RCVBUF mechanisms are limited by the global net.core.rmem_max and
net.core.wmem_max sysctls. Note that TCP actually allocates twice the
size of the buffer requested in the setsockopt(2) call, and so a suc-
ceeding getsockopt(2) call will not return the same size of buffer as
requested in the setsockopt(2) call
So what you pass for SO_SNDBUF/SO_RCVBUF gets doubled while allocation. And as such you cannot pass the max value (4194304) in setsockopt

How to find the socket buffer size of linux

What's the default socket buffer size of linux? Is there any command to see it?
If you want see your buffer size in terminal, you can take a look at:
/proc/sys/net/ipv4/tcp_rmem (for read)
/proc/sys/net/ipv4/tcp_wmem (for write)
They contain three numbers, which are minimum, default and maximum memory size values (in byte), respectively.
For getting the buffer size in c/c++ program the following is the flow
int n;
unsigned int m = sizeof(n);
int fdsocket;
fdsocket = socket(AF_INET,SOCK_DGRAM,IPPROTO_UDP); // example
getsockopt(fdsocket,SOL_SOCKET,SO_RCVBUF,(void *)&n, &m);
// now the variable n will have the socket size
Whilst, as has been pointed out, it is possible to see the current default socket buffer sizes in /proc, it is also possible to check them using sysctl (Note: Whilst the name includes ipv4 these sizes also apply to ipv6 sockets - the ipv6 tcp_v6_init_sock() code just calls the ipv4 tcp_init_sock() function):
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
However, the default socket buffers are just set when the sock is initialised but the kernel then dynamically sizes them (unless set using setsockopt() with SO_SNDBUF). The actual size of the buffers for currently open sockets may be inspected using the ss command (part of the iproute/iproute2 package), which can also provide a bunch more info on sockets like congestion control parameter etc. E.g. To list the currently open TCP (t option) sockets and associated memory (m) information:
ss -tm
Here's some example output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 192.168.56.102:ssh 192.168.56.1:56328
skmem:(r0,rb369280,t0,tb87040,f0,w0,o0,bl0,d0)
Here's a brief explanation of skmem (socket memory) - for more info you'll need to look at the kernel sources (i.e. sock.h):
r:sk_rmem_alloc
rb:sk_rcvbuf # current receive buffer size
t:sk_wmem_alloc
tb:sk_sndbuf # current transmit buffer size
f:sk_forward_alloc
w:sk_wmem_queued # persistent transmit queue size
o:sk_omem_alloc
bl:sk_backlog
d:sk_drops
I'm still trying to piece together the details, but to add to the answers already given, these are some of the important commands:
cat /proc/sys/net/ipv4/udp_mem
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/ipv4/tcp_rmem
cat /proc/sys/net/ipv4/tcp_wmem
ss -m # see `man ss`
References & help pages:
Man pages
man 7 socket
man 7 udp
man 7 tcp
man ss
https://www.linux.org/threads/how-to-calculate-tcp-socket-memory-usage.32059/
Atomic size is 4096 bytes, max size is 65536 bytes. Sendfile uses 16 pipes each of 4096 bytes size.
cmd : ioctl(fd, FIONREAD, &buff_size).

What are the units of UDP buffers, and where are docs for sysctl params?

I'm running x86_64 RedHat 5.3 (kernel 2.6.18) and looking specifically at net.core.rmem_max from sysctl -a in the context of trying to set UDP buffers. The receiver application misses packets sometimes, but I think the buffer is already plenty large, depending upon what it means:
What are the units of this setting -- bits, bytes, packets, or pages? If bits or bytes, is it from the datagram/ payload (such as 100 bytes) or the network MTU size (~1500 bytes)? If pages, what's the page size in bytes?
And is this the max per system, per physical device (NIC), per virtual device (VLAN), per process, per thread, per socket/ per multicast group?
For example, suppose my data is 100 bytes per message, and each network packet holds 2 messages, and I want to be able to buffer 50,000 messages per socket, and I open 3 sockets per thread on each of 4 threads. How big should net.core.rmem_max be? Likewise, when I set socket options inside the application, are the units payload bytes, so 5000000 on each socket in this case?
Finally, in general how would I find details of the units for the parameters I see via sysctl -a? I have similar units and per X questions about other parameters such as net.core.netdev_max_backlog and net.ipv4.igmp_max_memberships.
Thank you.
You'd look at these docs. That said, many of these parameters really are quite poorly documented, so do expect do do som googling to dig out the gory details from blogs and mailinglists.
rmem_max is the per socket maximum buffer, in bytes. Digging around, this appears to be the memory where whole packets are received, so the size have to include the sizes of whatever/ip/udp headers as well - though this area is quite fuzzy to me.
Keep in mind though, UDP is unreliable. There's a lot of sources for loss, not the least inbetween switches and routers - these have buffers as well.
It is fully documented in the socket(7) man page (it is in bytes).
Moreover, the limit may be set on a per-socket basis with SO_RCVBUF (as documented in the same page).
Read the socket(7), ip(7) and udp(7) man pages for information on how these things actually work. The sysctls are documented there.

Specifying UDP receive buffer size at runtime in Linux

In Linux, one can specify the system's default receive buffer size for network packets, say UDP, using the following commands:
sysctl -w net.core.rmem_max=<value>
sysctl -w net.core.rmem_default=<value>
But I wonder, is it possible for an application (say, in c) to override system's defaults by specifying the receive buffer size per UDP socket in runtime?
You can increase the value from the default, but you can't increase it beyond the maximum value. Use setsockopt to change the SO_RCVBUF option:
int n = 1024 * 1024;
if (setsockopt(socket, SOL_SOCKET, SO_RCVBUF, &n, sizeof(n)) == -1) {
// deal with failure, or ignore if you can live with the default size
}
Note that this is the portable solution; it should work on any POSIX platform for increasing the receive buffer size. Linux has had autotuning for a while now (since 2.6.7, and with reasonable maximum buffer sizes since 2.6.17), which automatically adjusts the receive buffer size based on load. On kernels with autotuning, it is recommended that you not set the receive buffer size using setsockopt, as that will disable the kernel's autotuning. Using setsockopt to adjust the buffer size may still be necessary on other platforms, however.

Resources