Linux: proc/net/sockstat tcp mem more and more larger - linux

Now, our system find hang and tcp mem more and more larger through /proc/net/sockstat. when hang appear, will print :
"tcp:too many of orphaned sockets"
From sockstat, we know there are few socket, but consumes 1500 page mem, why ?
So I have 2 questions :
How to know Which process consumes tcp socket memory?
How to avoid "tcp:too many of orphaned sockets"?
(1)
~ # cat /proc/net/sockstat
sockets: used 56
TCP: inuse 6 orphan 0 tw 1 alloc 8 mem 1510
UDP: inuse 8 mem 6
UDPLITE: inuse 0
RAW: inuse 4
FRAG: inuse 0 memory 0
(2)
~ # cat /proc/sys/net/ipv4/tcp_mem
900 1200 1800
~ # cat /proc/sys/net/ipv4/tcp_rmem
4096 87380 87380
~ # cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 65536

For #1, memory consumption for sockets is the sum of
the socket descriptors
in-kernel send queues (stuff waiting to be sent out by the NIC)
in-kernel receive queues (stuff that's been received, but hasn't yet been read by user space yet).
(this post is relevant here)
For your example output from /proc/net/sockstat, the number of sockets is small, so check the size of their send/receive queues. You can do this using commands like netstat -tanp or ss -tp. Keep in mind that send and receive buffer sizes displayed with e.g. ss -m are maximum values (constrained with tcp_rmem and `tcp_wmem), not the currently allocated values.
For #2, this post explains that the "too many orphan socket" is caused by the number of orphans increasing past the value in /proc/sys/net/ipv4/tcp_max_orphans, though some kinds of "bad" sockets are penalized more than others, so you could hit the error even if you're 2x or 4x below the limit.

Related

unreasonable netperf benchmark results

I used netperf benchmark with the next commands:
server side:
netserver -4 -v -d -N -p
client side:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
And I received the results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 9147.83
16384 131072
But when I changed the client to single CPU (same machine) by adding "maxcpus=1 nr_cpus=1" to kernel command line.
And I ran the next command:
netperf -H -p -l 60 -t TCP_RR
I received the next results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 10183.33
16384 131072
Q: I don't understand how the performance has been improved when I decreased the CPUs number from 64 to 1 CPU?
Some technique information: I used Standard_L64s_v3 instance type of Azure; OS: sles:15:sp2
• The ‘netperf’ utility command executed by you on the client side is as follows and is the same after changing the number of CPUs on the client side but you can see an improvement in performance after decreasing the number of vCPUs on the client VM: -
netperf -H -p -l 60 -I 1,1 -t TCP_RR
The above command implies that you want to test the network connectivity performance between the host ‘Server’ and ‘Client’ for TCP Request/Response and get the results in a default directory path where pipes will be created for a period of 60 seconds.
• The CPU utilization measurement mechanism uses ‘proc/stat’ on Linux OS to record the time spent for such command executions. The code for this mechanism can be found in ‘src/netcpu_procstat.c’. Thus, you can check the configuration file accordingly.
Also, the CPU utilization mechanism in a virtual guest environment, i.e., a virtual machine may not reflect the actual utilization as in a bare metal environment because much of the networking processing happens outside the context of the virtual machine. Thus, as per the below documentation link by Hewlett-Packard: -
https://hewlettpackard.github.io/netperf/doc/netperf.html
If one is looking to measure the added overhead of a virtualization mechanism, rather than rely on CPU utilization, one can rely instead on netperf _RR tests - path-lengths and overheads can be a significant fraction of the latency, so increases in overhead should appear as decreases in transaction rate. Whatever you do, DO NOT rely on the throughput of a _STREAM test. Achieving link-rate can be done via a multitude of options that mask overhead rather than eliminate it.
As a result, I would suggest you rely on other monitoring tools available in Azure, i.e., Azure Monitor, Application insights, etc.
Looking more closely at your netperf command line:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
The -H option expects to take a hostname as an argument. And the -p option expects to take a port number as an argument. As written the "-p" will be interpreted as a hostname. And when I tried it at least will fail. I assume you've omitted some of the command line?
The -T option will bind where netperf and netserver will run (in this case on vCPU 1 on the netperf side and vCPU 1 on the netserver side) but it will not necessarily control where at least some of the network stack processing will take place. So, in your 64-vCPU setup, the interrupts for the networking traffic and perhaps the stack will run on a different vCPU. In your 1-vCPU setup, everything will be on the one vCPU. It is quite conceivable you are seeing the effects of cache-to-cache transfers in the 64-vCPU case leading to lower transaction/s rates.
Going to multi-processor will increase aggregate performance, but it will not necessarily increase single thread/stream performance. And single thread/stream performance can indeed degrade.

Why isn't increasing the networking buffer sizes reducing packet drops?

Running Ubuntu 18.04.4 LTS
I have a high-bandwidth file transfer application (UDP) that i'm testing locally using the loopback interface.
With no simulated latency, I can transfer a 1GB file at maximum speed with <1% packet loss. To achieve this, I had to increase the networking buffer sizes from ~200KB to 8MB:
sudo sysctl -w net.core.rmem_max=8388608
sudo sysctl -w net.core.wmem_max=8388608
sudo sysctl -p
For additional testing, I wanted to add a simulated latency of 100ms. This is intended to simulate propagation delay, not queuing delay. I accomplished this using the Linux traffic control (tc) tool:
sudo tc qdisc add dev lo root netem delay 100ms
After adding the latency, packet loss for the 1GB transfer at maximum speed went from <1% to ~97%. In a real network, latency caused by propagation delay shouldn't cause packet loss, so I think the issue is that to simulate latency the kernel would have to store packets in RAM while applying the delay. Since my buffers were only set to 8MB, it made sense that a significant amount of packets would be dropped if simulated latency was added.
I increased my buffer sizes to 50MB:
sudo sysctl -w net.core.rmem_max=52428800
sudo sysctl -w net.core.wmem_max=52428800
sudo sysctl -p
However, there was no noticeable reduction in packet loss. I also attempted 1GB buffer sizes with similar results (my system has >90GB of RAM available).
Why did increasing system network buffer sizes not work in this case?
For some versions of tc, if you do not specify a buffer count limit, tc will default to 1000 buffers.
You can check how many buffers tc is currently using by running:
tc -s qdisc ls dev <device>
For example on my system, where I’ve simulated a 0.1s delay on the eth0 interface I get:
$ tc -s qdisc ls dev eth0
qdisc netem 8024: root refcnt 2 limit 1000 delay 0.1s
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
This shows that I have limit 1000 buffers available to fill during my 0.1s delay period. If I go over this many buffers in my delay timeframe, the system will start dropping packets. Thus this means I have a packet per second (pps) limit of:
pps = buffers / delay
pps = 1000 / 0.1
pps = 10000
If I go beyond this limit, the system will be forced to either drop the incoming packet right away or replace a queued packet, dropping it instead.
Since we don’t normally think of network flows in pps, it’s useful to convert from pps to Bps, KBps, or GBps. This can be done by multiplying by either the network MTU (generally 1500 bytes), the buffer size (varies by system), or ideally by the observed average number of bytes per packet seen by your system on the given interface. Since we don’t know the average bytes per packet, or buffer size of your system at the moment, we’ll fallback to using the typical MTU.
byte rate = pps * bytes per packet
byte rate = 10000pps * 1500 bytes per packet
byte rate = 15000000 Bytes per second
byte rate = 15 MBps
If we are talking about a loopback interface that normally runs at an average of say ~5 Gbps, such as what iperf3 reports for the loopback interface on this MacBook, we can see the problem right away, in that our tc limit of 1.5 MBps is far less than the interface’s practical limit of ~5 GBps.
So if we were transferring a 1GB file over the loopback interface of this system, it should take:
time = file size / byte rate
time = 1Gb / 5GBps
time = 0.2 seconds
To transfer the file across the loopback interface.And the loss, assuming packet size matches buffer size, would be:
packets lost = packets - ((packets that fit in buffers) + (drain rate of buffers * timeframe))
packets lost = (file size / MTU) - ((buffer count) + (drain rate * timeframe))
packets lost = (1 GB / 1500 bytes) - ((10000) + (10000Hz * 0.2 seconds))
packets lost = 654667
And that’s out of:
packets = (file size / MTU)
packets = (1 GB / 1500 bytes)
packets = 666667
So in all that would be a loss percentage of:
loss % = 100 * (lost) / (total)
loss % = 100 * 654667 / 666667
loss % = 98.2%
Which happens to be roughly in line with what you are seeing.
So why didn’t increasing the system buffer size impact your losses? After all the buffer size is part of the computation.
The answer there, is that the method you are using to transmit your file is likely chunking according to it’s best guess at the MTU (likely 1500 bytes), and the packets only make use of the first 1500 bytes of your extra large buffers.
Thus the solution should probably be to increase the number of buffers available to tc instead of increasing the system buffer size. But how many buffers do you need for this link? Based off of this answer the recommendation is to use 150% of the expected number of packets for your delay, so that’s:
buffers = (network rate / avg packet size) * delay * 150%
buffers = (5GBps / 1500B) * 0.1s * 150%
buffers = 333000 * 150%
buffers = 500000
You can see right away that that’s 500 times as many buffers as tc tries to use by default, or to put it another way you only had 2% of the buffers you needed so you saw 98% loss.
Thus to fix your problem, try changing your tc command from something like:
sudo tc qdisc add dev <device> root netem delay 0.1s
To something like:
sudo tc qdisc add dev <device> root netem delay 0.1s limit 500000
To my knowledge, even though its not what you are trying to achieve.. you should probably throtlle up the speed at which you are sending UDP packets because indeed as pointed out by #user3878723 buffers will quickly fill up and packets will be lost. Said differently - quite like #Ron Maupin - when applying delay the interface gets congested. I don't think the emitting process is aware of the 100ms delay so it might overwhelm all available resources quickly.
Instead you may have to tweak something like a Token Bucket Filter (TBF) if you want to go farther in your very use case. Also consider "Rate control".
UPDATE
It could be worth modifying these parameters and make them persistent
net.core.rmem_default
net.core.wmem_default
And/Or make sure you are using correctly these options in your emitter/receiver:
SO_SNDBUF
SO_RCVBUF
So that the whole chain has enough buffer.

How to scale ejabberd Server machine on CentOS to handle 200 K connections?

I am working on a considerably good ejabberd instance with 40 core CPU machine and 160 GB RAM.
The issue is I am unable to scale up to 200 K parallel connections.
The sysctl config is as follows:
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
#http://linux-ip.net/html/ether-arp.html#ether-arp-flux
net.ipv4.conf.all.arp_filter = 1
kernel.exec-shield=1
kernel.randomize_va_space=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.ip_local_port_range = 12000 65535
fs.nr_open = 20000500
fs.file-max = 1000000
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000
The /etc/security/limits.conf file entries is as follows:
* soft core 900000
* hard rss 900000
* soft nofile 900000
* hard nofile 900000
* soft nproc 900000
* hard nproc 900000
The machine starts to lose connections when the server reaches around 112 K.
Things that happen around 112 K
The CPU usage goes up to 200 ~ 300 % (but it is the usual spike)
Background - When all things are normal the CPU usage shoots up to 80 % as seen below (only two CPUs are doing actual work)
I am unable to work on the machine. I am using top and ss command to see what is going on the server. The machine just stops responding at this point and the connections begin to drop.
What is a saving grace is that the connections don't drop abruptly, but drop at the rate they are connected.
I am using TSUNG to generate the load. There are 4 load generator boxes hitting 4 different ips mapped to only one machine internally.
Any suggestions, opinions are very welcome.
As the first call you would need to establish what's the bottleneck in your case:
CPU
Memory
System limits (open sockets, open files)
Application architecture
If possible add a resource-tracking application to your node, e.g. recon. It will allow you to check the length of process queues, memory fragmentation, etc. In our production system the amount of memory consumed by Erlang VM was different when reported by the system than when reported by the Erlang VM itself due to Transparent Huge Pages (the system was virtualized). There may be other issues that may not be obvious when inspecting the node using system tools.
So I would propose:
Determine processes with the longest queue sizes - they will be responsible for slowing down the system because Erlang VM needs to scan the whole inbox of a process when it receives a message
Determine processes with the biggest amount of allocated memory
Determine how much memory Erlang itself thinks is allocated
Also, it would be good if you added parameters used to start the Erlang VM.
Addition
Forgot to mention that it may be worth looking at the tuning WhatsApp did to their Erlang nodes to handle hundreds of thousands of simultaneous connections:
The WhatsApp Architecture Facebook Bought For $19 Billion

Get network connections from /proc/net/sockstat

I founds this information in /proc which displays sockets:
$ cat /proc/net/sockstat
sockets: used 8278
TCP: inuse 1090 orphan 2 tw 18 alloc 1380 mem 851
UDP: inuse 6574
RAW: inuse 1
FRAG: inuse 0 memory 0
Can you help me to find what these values means? Also are these values enough reliable or I need to search for it somewhere else?
Is these other way to find information about the TCP/UDP connections in Linux?
Can you help me to find what these values means?
As per code here the values are number of sockets in use (TCP / UDP), number of orphan TCP sockets (socket that applications have no more handles to, they already called close()). TCP tw I am not sure but, based on structure name (tcp_death_row), those are the sockets to be definitively destroyed in near future? sockets represent the number of allocated sockets (as per my understand, contemplates TCP sockets in different states) and mem is number of pages allocated by TCP sockets (memory usage).
This article has some discussions around this topic.
In my understanding the /proc/net/sockstat is The most reliable place to look for that information. I often use it myself, and to have one single server to manage 1MM simultaneous connections that was the only place I could reliably count that information.
You can use the netstat command which itself utilizes the /proc filesystem but prints information more readable for humans.
If you want to display the current tcp connections for example, you can issue the following command:
netstat -t
Check man netstat for the numerous options.

How to find the socket buffer size of linux

What's the default socket buffer size of linux? Is there any command to see it?
If you want see your buffer size in terminal, you can take a look at:
/proc/sys/net/ipv4/tcp_rmem (for read)
/proc/sys/net/ipv4/tcp_wmem (for write)
They contain three numbers, which are minimum, default and maximum memory size values (in byte), respectively.
For getting the buffer size in c/c++ program the following is the flow
int n;
unsigned int m = sizeof(n);
int fdsocket;
fdsocket = socket(AF_INET,SOCK_DGRAM,IPPROTO_UDP); // example
getsockopt(fdsocket,SOL_SOCKET,SO_RCVBUF,(void *)&n, &m);
// now the variable n will have the socket size
Whilst, as has been pointed out, it is possible to see the current default socket buffer sizes in /proc, it is also possible to check them using sysctl (Note: Whilst the name includes ipv4 these sizes also apply to ipv6 sockets - the ipv6 tcp_v6_init_sock() code just calls the ipv4 tcp_init_sock() function):
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
However, the default socket buffers are just set when the sock is initialised but the kernel then dynamically sizes them (unless set using setsockopt() with SO_SNDBUF). The actual size of the buffers for currently open sockets may be inspected using the ss command (part of the iproute/iproute2 package), which can also provide a bunch more info on sockets like congestion control parameter etc. E.g. To list the currently open TCP (t option) sockets and associated memory (m) information:
ss -tm
Here's some example output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 192.168.56.102:ssh 192.168.56.1:56328
skmem:(r0,rb369280,t0,tb87040,f0,w0,o0,bl0,d0)
Here's a brief explanation of skmem (socket memory) - for more info you'll need to look at the kernel sources (i.e. sock.h):
r:sk_rmem_alloc
rb:sk_rcvbuf # current receive buffer size
t:sk_wmem_alloc
tb:sk_sndbuf # current transmit buffer size
f:sk_forward_alloc
w:sk_wmem_queued # persistent transmit queue size
o:sk_omem_alloc
bl:sk_backlog
d:sk_drops
I'm still trying to piece together the details, but to add to the answers already given, these are some of the important commands:
cat /proc/sys/net/ipv4/udp_mem
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/ipv4/tcp_rmem
cat /proc/sys/net/ipv4/tcp_wmem
ss -m # see `man ss`
References & help pages:
Man pages
man 7 socket
man 7 udp
man 7 tcp
man ss
https://www.linux.org/threads/how-to-calculate-tcp-socket-memory-usage.32059/
Atomic size is 4096 bytes, max size is 65536 bytes. Sendfile uses 16 pipes each of 4096 bytes size.
cmd : ioctl(fd, FIONREAD, &buff_size).

Resources