unreasonable netperf benchmark results - azure

I used netperf benchmark with the next commands:
server side:
netserver -4 -v -d -N -p
client side:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
And I received the results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 9147.83
16384 131072
But when I changed the client to single CPU (same machine) by adding "maxcpus=1 nr_cpus=1" to kernel command line.
And I ran the next command:
netperf -H -p -l 60 -t TCP_RR
I received the next results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 10183.33
16384 131072
Q: I don't understand how the performance has been improved when I decreased the CPUs number from 64 to 1 CPU?
Some technique information: I used Standard_L64s_v3 instance type of Azure; OS: sles:15:sp2

• The ‘netperf’ utility command executed by you on the client side is as follows and is the same after changing the number of CPUs on the client side but you can see an improvement in performance after decreasing the number of vCPUs on the client VM: -
netperf -H -p -l 60 -I 1,1 -t TCP_RR
The above command implies that you want to test the network connectivity performance between the host ‘Server’ and ‘Client’ for TCP Request/Response and get the results in a default directory path where pipes will be created for a period of 60 seconds.
• The CPU utilization measurement mechanism uses ‘proc/stat’ on Linux OS to record the time spent for such command executions. The code for this mechanism can be found in ‘src/netcpu_procstat.c’. Thus, you can check the configuration file accordingly.
Also, the CPU utilization mechanism in a virtual guest environment, i.e., a virtual machine may not reflect the actual utilization as in a bare metal environment because much of the networking processing happens outside the context of the virtual machine. Thus, as per the below documentation link by Hewlett-Packard: -
https://hewlettpackard.github.io/netperf/doc/netperf.html
If one is looking to measure the added overhead of a virtualization mechanism, rather than rely on CPU utilization, one can rely instead on netperf _RR tests - path-lengths and overheads can be a significant fraction of the latency, so increases in overhead should appear as decreases in transaction rate. Whatever you do, DO NOT rely on the throughput of a _STREAM test. Achieving link-rate can be done via a multitude of options that mask overhead rather than eliminate it.
As a result, I would suggest you rely on other monitoring tools available in Azure, i.e., Azure Monitor, Application insights, etc.

Looking more closely at your netperf command line:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
The -H option expects to take a hostname as an argument. And the -p option expects to take a port number as an argument. As written the "-p" will be interpreted as a hostname. And when I tried it at least will fail. I assume you've omitted some of the command line?
The -T option will bind where netperf and netserver will run (in this case on vCPU 1 on the netperf side and vCPU 1 on the netserver side) but it will not necessarily control where at least some of the network stack processing will take place. So, in your 64-vCPU setup, the interrupts for the networking traffic and perhaps the stack will run on a different vCPU. In your 1-vCPU setup, everything will be on the one vCPU. It is quite conceivable you are seeing the effects of cache-to-cache transfers in the 64-vCPU case leading to lower transaction/s rates.
Going to multi-processor will increase aggregate performance, but it will not necessarily increase single thread/stream performance. And single thread/stream performance can indeed degrade.

Related

Why isn't increasing the networking buffer sizes reducing packet drops?

Running Ubuntu 18.04.4 LTS
I have a high-bandwidth file transfer application (UDP) that i'm testing locally using the loopback interface.
With no simulated latency, I can transfer a 1GB file at maximum speed with <1% packet loss. To achieve this, I had to increase the networking buffer sizes from ~200KB to 8MB:
sudo sysctl -w net.core.rmem_max=8388608
sudo sysctl -w net.core.wmem_max=8388608
sudo sysctl -p
For additional testing, I wanted to add a simulated latency of 100ms. This is intended to simulate propagation delay, not queuing delay. I accomplished this using the Linux traffic control (tc) tool:
sudo tc qdisc add dev lo root netem delay 100ms
After adding the latency, packet loss for the 1GB transfer at maximum speed went from <1% to ~97%. In a real network, latency caused by propagation delay shouldn't cause packet loss, so I think the issue is that to simulate latency the kernel would have to store packets in RAM while applying the delay. Since my buffers were only set to 8MB, it made sense that a significant amount of packets would be dropped if simulated latency was added.
I increased my buffer sizes to 50MB:
sudo sysctl -w net.core.rmem_max=52428800
sudo sysctl -w net.core.wmem_max=52428800
sudo sysctl -p
However, there was no noticeable reduction in packet loss. I also attempted 1GB buffer sizes with similar results (my system has >90GB of RAM available).
Why did increasing system network buffer sizes not work in this case?
For some versions of tc, if you do not specify a buffer count limit, tc will default to 1000 buffers.
You can check how many buffers tc is currently using by running:
tc -s qdisc ls dev <device>
For example on my system, where I’ve simulated a 0.1s delay on the eth0 interface I get:
$ tc -s qdisc ls dev eth0
qdisc netem 8024: root refcnt 2 limit 1000 delay 0.1s
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
This shows that I have limit 1000 buffers available to fill during my 0.1s delay period. If I go over this many buffers in my delay timeframe, the system will start dropping packets. Thus this means I have a packet per second (pps) limit of:
pps = buffers / delay
pps = 1000 / 0.1
pps = 10000
If I go beyond this limit, the system will be forced to either drop the incoming packet right away or replace a queued packet, dropping it instead.
Since we don’t normally think of network flows in pps, it’s useful to convert from pps to Bps, KBps, or GBps. This can be done by multiplying by either the network MTU (generally 1500 bytes), the buffer size (varies by system), or ideally by the observed average number of bytes per packet seen by your system on the given interface. Since we don’t know the average bytes per packet, or buffer size of your system at the moment, we’ll fallback to using the typical MTU.
byte rate = pps * bytes per packet
byte rate = 10000pps * 1500 bytes per packet
byte rate = 15000000 Bytes per second
byte rate = 15 MBps
If we are talking about a loopback interface that normally runs at an average of say ~5 Gbps, such as what iperf3 reports for the loopback interface on this MacBook, we can see the problem right away, in that our tc limit of 1.5 MBps is far less than the interface’s practical limit of ~5 GBps.
So if we were transferring a 1GB file over the loopback interface of this system, it should take:
time = file size / byte rate
time = 1Gb / 5GBps
time = 0.2 seconds
To transfer the file across the loopback interface.And the loss, assuming packet size matches buffer size, would be:
packets lost = packets - ((packets that fit in buffers) + (drain rate of buffers * timeframe))
packets lost = (file size / MTU) - ((buffer count) + (drain rate * timeframe))
packets lost = (1 GB / 1500 bytes) - ((10000) + (10000Hz * 0.2 seconds))
packets lost = 654667
And that’s out of:
packets = (file size / MTU)
packets = (1 GB / 1500 bytes)
packets = 666667
So in all that would be a loss percentage of:
loss % = 100 * (lost) / (total)
loss % = 100 * 654667 / 666667
loss % = 98.2%
Which happens to be roughly in line with what you are seeing.
So why didn’t increasing the system buffer size impact your losses? After all the buffer size is part of the computation.
The answer there, is that the method you are using to transmit your file is likely chunking according to it’s best guess at the MTU (likely 1500 bytes), and the packets only make use of the first 1500 bytes of your extra large buffers.
Thus the solution should probably be to increase the number of buffers available to tc instead of increasing the system buffer size. But how many buffers do you need for this link? Based off of this answer the recommendation is to use 150% of the expected number of packets for your delay, so that’s:
buffers = (network rate / avg packet size) * delay * 150%
buffers = (5GBps / 1500B) * 0.1s * 150%
buffers = 333000 * 150%
buffers = 500000
You can see right away that that’s 500 times as many buffers as tc tries to use by default, or to put it another way you only had 2% of the buffers you needed so you saw 98% loss.
Thus to fix your problem, try changing your tc command from something like:
sudo tc qdisc add dev <device> root netem delay 0.1s
To something like:
sudo tc qdisc add dev <device> root netem delay 0.1s limit 500000
To my knowledge, even though its not what you are trying to achieve.. you should probably throtlle up the speed at which you are sending UDP packets because indeed as pointed out by #user3878723 buffers will quickly fill up and packets will be lost. Said differently - quite like #Ron Maupin - when applying delay the interface gets congested. I don't think the emitting process is aware of the 100ms delay so it might overwhelm all available resources quickly.
Instead you may have to tweak something like a Token Bucket Filter (TBF) if you want to go farther in your very use case. Also consider "Rate control".
UPDATE
It could be worth modifying these parameters and make them persistent
net.core.rmem_default
net.core.wmem_default
And/Or make sure you are using correctly these options in your emitter/receiver:
SO_SNDBUF
SO_RCVBUF
So that the whole chain has enough buffer.

How to scale ejabberd Server machine on CentOS to handle 200 K connections?

I am working on a considerably good ejabberd instance with 40 core CPU machine and 160 GB RAM.
The issue is I am unable to scale up to 200 K parallel connections.
The sysctl config is as follows:
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
#http://linux-ip.net/html/ether-arp.html#ether-arp-flux
net.ipv4.conf.all.arp_filter = 1
kernel.exec-shield=1
kernel.randomize_va_space=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.ip_local_port_range = 12000 65535
fs.nr_open = 20000500
fs.file-max = 1000000
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000
The /etc/security/limits.conf file entries is as follows:
* soft core 900000
* hard rss 900000
* soft nofile 900000
* hard nofile 900000
* soft nproc 900000
* hard nproc 900000
The machine starts to lose connections when the server reaches around 112 K.
Things that happen around 112 K
The CPU usage goes up to 200 ~ 300 % (but it is the usual spike)
Background - When all things are normal the CPU usage shoots up to 80 % as seen below (only two CPUs are doing actual work)
I am unable to work on the machine. I am using top and ss command to see what is going on the server. The machine just stops responding at this point and the connections begin to drop.
What is a saving grace is that the connections don't drop abruptly, but drop at the rate they are connected.
I am using TSUNG to generate the load. There are 4 load generator boxes hitting 4 different ips mapped to only one machine internally.
Any suggestions, opinions are very welcome.
As the first call you would need to establish what's the bottleneck in your case:
CPU
Memory
System limits (open sockets, open files)
Application architecture
If possible add a resource-tracking application to your node, e.g. recon. It will allow you to check the length of process queues, memory fragmentation, etc. In our production system the amount of memory consumed by Erlang VM was different when reported by the system than when reported by the Erlang VM itself due to Transparent Huge Pages (the system was virtualized). There may be other issues that may not be obvious when inspecting the node using system tools.
So I would propose:
Determine processes with the longest queue sizes - they will be responsible for slowing down the system because Erlang VM needs to scan the whole inbox of a process when it receives a message
Determine processes with the biggest amount of allocated memory
Determine how much memory Erlang itself thinks is allocated
Also, it would be good if you added parameters used to start the Erlang VM.
Addition
Forgot to mention that it may be worth looking at the tuning WhatsApp did to their Erlang nodes to handle hundreds of thousands of simultaneous connections:
The WhatsApp Architecture Facebook Bought For $19 Billion

Node.js struggling with lots of concurrent connections

I'm working on a somewhat unusual application where 10k clients are precisely timed to all try to submit data at once, every 3 mins or so. This 'ab' command fairly accurately simulates one barrage in the real world:
ab -c 10000 -n 10000 -r "http://example.com/submit?data=foo"
I'm using Node.js on Ubuntu 12.4 on a rackspacecloud VPS instance to collect these submissions, however, I'm seeing some very odd behavior from Node, even when I remove all my business logic and turn the http request into a no-op.
When the test gets about 90% done, it hangs for a long period of time. Strangely, this happens consistently at 90% - for c=n=10k, at 9000; for c=n=5k, at 4500; for c=n=2k, at 1800. The test actually completes eventually, often with no errors. But both ab and node logs show continuous processing up till around 80-90% of the test run, then a long pause before completing.
When node is processing requests normally, CPU usage is typically around 50-70%. During the hang period, CPU goes up to 100%. Sometimes it stays near 0. Between the erratic CPU response and the fact that it seems unrelated to the actual number of connections (only the % complete), I do not suspect the garbage collector.
I've tried this running 'ab' on localhost and on a remote server - same effect.
I suspect something related to the TCP stack, possibly involving closing connections, but none of my configuration changes have helped. My changes:
ulimit -n 999999
When I listen(), I set the backlog to 10000
Sysctl changes are:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_orphans = 20000
net.ipv4.tcp_max_syn_backlog = 10000
net.core.somaxconn = 10000
net.core.netdev_max_backlog = 10000
I have also noticed that I tend to get this msg in the kernel logs:
TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
I'm puzzled by this msg since the TCP backlog queue should be deep enough to never overflow. If I disable syn cookies the "Sending cookies" goes to "Dropping connections".
I speculate that this is some sort of linux TCP stack tuning problem and I've read just about everything I could find on the net. Nothing I have tried seems to matter. Any advice?
Update: Tried with tcp_max_syn_backlog, somaxconn, netdev_max_backlog, and the listen() backlog param set to 50k with no change in behavior. Still produces the SYN flood warning, too.
Are you running ab on the same machine running node? If not do you have a 1G or 10G NIC? If you are, then aren't you really trying to process 20,000 open connections?
Also if you are changing net.core.somaxconn to 10,000 you have absolutely no other sockets open on that machine? If you do then 10,000 is not high enough.
Have you tried to use nodejs cluster to spread the number of open connections per process out?
I think you might find this blog post and also the previous ones useful
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/

Traffic shaping with tc is inaccurate with high bandwidth and delay

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?
It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.
Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Increasing the maximum number of TCP/IP connections in Linux

I am programming a server and it seems like my number of connections is being limited since my bandwidth isn't being saturated even when I've set the number of connections to "unlimited".
How can I increase or eliminate a maximum number of connections that my Ubuntu Linux box can open at a time? Does the OS limit this, or is it the router or the ISP? Or is it something else?
Maximum number of connections are impacted by certain limits on both client & server sides, albeit a little differently.
On the client side:
Increase the ephermal port range, and decrease the tcp_fin_timeout
To find out the default values:
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_fin_timeout
The ephermal port range defines the maximum number of outbound sockets a host can create from a particular I.P. address. The fin_timeout defines the minimum time these sockets will stay in TIME_WAIT state (unusable after being used once).
Usual system defaults are:
net.ipv4.ip_local_port_range = 32768 61000
net.ipv4.tcp_fin_timeout = 60
This basically means your system cannot consistently guarantee more than (61000 - 32768) / 60 = 470 sockets per second. If you are not happy with that, you could begin with increasing the port_range. Setting the range to 15000 61000 is pretty common these days. You could further increase the availability by decreasing the fin_timeout. Suppose you do both, you should see over 1500 outbound connections per second, more readily.
To change the values:
sysctl net.ipv4.ip_local_port_range="15000 61000"
sysctl net.ipv4.tcp_fin_timeout=30
The above should not be interpreted as the factors impacting system capability for making outbound connections per second. But rather these factors affect system's ability to handle concurrent connections in a sustainable manner for large periods of "activity."
Default Sysctl values on a typical Linux box for tcp_tw_recycle & tcp_tw_reuse would be
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_tw_reuse=0
These do not allow a connection from a "used" socket (in wait state) and force the sockets to last the complete time_wait cycle. I recommend setting:
sysctl net.ipv4.tcp_tw_recycle=1
sysctl net.ipv4.tcp_tw_reuse=1
This allows fast cycling of sockets in time_wait state and re-using them. But before you do this change make sure that this does not conflict with the protocols that you would use for the application that needs these sockets. Make sure to read post "Coping with the TCP TIME-WAIT" from Vincent Bernat to understand the implications. The net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you. Note that net.ipv4.tcp_tw_recycle has been removed from Linux 4.12.
On the Server Side:
The net.core.somaxconn value has an important role. It limits the maximum number of requests queued to a listen socket. If you are sure of your server application's capability, bump it up from default 128 to something like 128 to 1024. Now you can take advantage of this increase by modifying the listen backlog variable in your application's listen call, to an equal or higher integer.
sysctl net.core.somaxconn=1024
txqueuelen parameter of your ethernet cards also have a role to play. Default values are 1000, so bump them up to 5000 or even more if your system can handle it.
ifconfig eth0 txqueuelen 5000
echo "/sbin/ifconfig eth0 txqueuelen 5000" >> /etc/rc.local
Similarly bump up the values for net.core.netdev_max_backlog and net.ipv4.tcp_max_syn_backlog. Their default values are 1000 and 1024 respectively.
sysctl net.core.netdev_max_backlog=2000
sysctl net.ipv4.tcp_max_syn_backlog=2048
Now remember to start both your client and server side applications by increasing the FD ulimts, in the shell.
Besides the above one more popular technique used by programmers is to reduce the number of tcp write calls. My own preference is to use a buffer wherein I push the data I wish to send to the client, and then at appropriate points I write out the buffered data into the actual socket. This technique allows me to use large data packets, reduce fragmentation, reduces my CPU utilization both in the user land and at kernel-level.
There are a couple of variables to set the max number of connections. Most likely, you're running out of file numbers first. Check ulimit -n. After that, there are settings in /proc, but those default to the tens of thousands.
More importantly, it sounds like you're doing something wrong. A single TCP connection ought to be able to use all of the bandwidth between two parties; if it isn't:
Check if your TCP window setting is large enough. Linux defaults are good for everything except really fast inet link (hundreds of mbps) or fast satellite links. What is your bandwidth*delay product?
Check for packet loss using ping with large packets (ping -s 1472 ...)
Check for rate limiting. On Linux, this is configured with tc
Confirm that the bandwidth you think exists actually exists using e.g., iperf
Confirm that your protocol is sane. Remember latency.
If this is a gigabit+ LAN, can you use jumbo packets? Are you?
Possibly I have misunderstood. Maybe you're doing something like Bittorrent, where you need lots of connections. If so, you need to figure out how many connections you're actually using (try netstat or lsof). If that number is substantial, you might:
Have a lot of bandwidth, e.g., 100mbps+. In this case, you may actually need to up the ulimit -n. Still, ~1000 connections (default on my system) is quite a few.
Have network problems which are slowing down your connections (e.g., packet loss)
Have something else slowing you down, e.g., IO bandwidth, especially if you're seeking. Have you checked iostat -x?
Also, if you are using a consumer-grade NAT router (Linksys, Netgear, DLink, etc.), beware that you may exceed its abilities with thousands of connections.
I hope this provides some help. You're really asking a networking question.
To improve upon the answer given by #derobert,
You can determine what your OS connection limit is by catting nf_conntrack_max. For example:
cat /proc/sys/net/netfilter/nf_conntrack_max
You can use the following script to count the number of TCP connections to a given range of tcp ports. By default 1-65535.
This will confirm whether or not you are maxing out your OS connection limit.
Here's the script.
#!/bin/sh
OS=$(uname)
case "$OS" in
'SunOS')
AWK=/usr/bin/nawk
;;
'Linux')
AWK=/bin/awk
;;
'AIX')
AWK=/usr/bin/awk
;;
esac
netstat -an | $AWK -v start=1 -v end=65535 ' $NF ~ /TIME_WAIT|ESTABLISHED/ && $4 !~ /127\.0\.0\.1/ {
if ($1 ~ /\./)
{sip=$1}
else {sip=$4}
if ( sip ~ /:/ )
{d=2}
else {d=5}
split( sip, a, /:|\./ )
if ( a[d] >= start && a[d] <= end ) {
++connections;
}
}
END {print connections}'
In an application level, here are something a developer can do:
From server side:
Check if load balancer(if you have),works correctly.
Turn slow TCP timeouts into 503 Fast Immediate response, if you load balancer work correctly, it should pick the working resource to serve, and it's better than hanging there with unexpected error massages.
Eg: If you are using node server, u can use toobusy from npm.
Implementation something like:
var toobusy = require('toobusy');
app.use(function(req, res, next) {
if (toobusy()) res.send(503, "I'm busy right now, sorry.");
else next();
});
Why 503? Here are some good insights for overload:
http://ferd.ca/queues-don-t-fix-overload.html
We can do some work in client side too:
Try to group calls in batch, reduce the traffic and total requests number b/w client and server.
Try to build a cache mid-layer to handle unnecessary duplicates requests.
im trying to resolve this in 2022 on loadbalancers and one way I found is to attach another IPv4 (or eventualy IPv6) to NIC, so the limit is now doubled. Of course you need to configure the second IP to the service which is trying to connect to the machine (in my case another DNS entry)

Resources