Traffic shaping with tc is inaccurate with high bandwidth and delay

Traffic shaping with tc is inaccurate with high bandwidth and delay - linux

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?

It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.

Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Related

Why isn't increasing the networking buffer sizes reducing packet drops?

Running Ubuntu 18.04.4 LTS
I have a high-bandwidth file transfer application (UDP) that i'm testing locally using the loopback interface.
With no simulated latency, I can transfer a 1GB file at maximum speed with <1% packet loss. To achieve this, I had to increase the networking buffer sizes from ~200KB to 8MB:
sudo sysctl -w net.core.rmem_max=8388608
sudo sysctl -w net.core.wmem_max=8388608
sudo sysctl -p
For additional testing, I wanted to add a simulated latency of 100ms. This is intended to simulate propagation delay, not queuing delay. I accomplished this using the Linux traffic control (tc) tool:
sudo tc qdisc add dev lo root netem delay 100ms
After adding the latency, packet loss for the 1GB transfer at maximum speed went from <1% to ~97%. In a real network, latency caused by propagation delay shouldn't cause packet loss, so I think the issue is that to simulate latency the kernel would have to store packets in RAM while applying the delay. Since my buffers were only set to 8MB, it made sense that a significant amount of packets would be dropped if simulated latency was added.
I increased my buffer sizes to 50MB:
sudo sysctl -w net.core.rmem_max=52428800
sudo sysctl -w net.core.wmem_max=52428800
sudo sysctl -p
However, there was no noticeable reduction in packet loss. I also attempted 1GB buffer sizes with similar results (my system has >90GB of RAM available).
Why did increasing system network buffer sizes not work in this case?

For some versions of tc, if you do not specify a buffer count limit, tc will default to 1000 buffers.
You can check how many buffers tc is currently using by running:
tc -s qdisc ls dev <device>
For example on my system, where I’ve simulated a 0.1s delay on the eth0 interface I get:
$ tc -s qdisc ls dev eth0
qdisc netem 8024: root refcnt 2 limit 1000 delay 0.1s
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
This shows that I have limit 1000 buffers available to fill during my 0.1s delay period. If I go over this many buffers in my delay timeframe, the system will start dropping packets. Thus this means I have a packet per second (pps) limit of:
pps = buffers / delay
pps = 1000 / 0.1
pps = 10000
If I go beyond this limit, the system will be forced to either drop the incoming packet right away or replace a queued packet, dropping it instead.
Since we don’t normally think of network flows in pps, it’s useful to convert from pps to Bps, KBps, or GBps. This can be done by multiplying by either the network MTU (generally 1500 bytes), the buffer size (varies by system), or ideally by the observed average number of bytes per packet seen by your system on the given interface. Since we don’t know the average bytes per packet, or buffer size of your system at the moment, we’ll fallback to using the typical MTU.
byte rate = pps * bytes per packet
byte rate = 10000pps * 1500 bytes per packet
byte rate = 15000000 Bytes per second
byte rate = 15 MBps
If we are talking about a loopback interface that normally runs at an average of say ~5 Gbps, such as what iperf3 reports for the loopback interface on this MacBook, we can see the problem right away, in that our tc limit of 1.5 MBps is far less than the interface’s practical limit of ~5 GBps.
So if we were transferring a 1GB file over the loopback interface of this system, it should take:
time = file size / byte rate
time = 1Gb / 5GBps
time = 0.2 seconds
To transfer the file across the loopback interface.And the loss, assuming packet size matches buffer size, would be:
packets lost = packets - ((packets that fit in buffers) + (drain rate of buffers * timeframe))
packets lost = (file size / MTU) - ((buffer count) + (drain rate * timeframe))
packets lost = (1 GB / 1500 bytes) - ((10000) + (10000Hz * 0.2 seconds))
packets lost = 654667
And that’s out of:
packets = (file size / MTU)
packets = (1 GB / 1500 bytes)
packets = 666667
So in all that would be a loss percentage of:
loss % = 100 * (lost) / (total)
loss % = 100 * 654667 / 666667
loss % = 98.2%
Which happens to be roughly in line with what you are seeing.
So why didn’t increasing the system buffer size impact your losses? After all the buffer size is part of the computation.
The answer there, is that the method you are using to transmit your file is likely chunking according to it’s best guess at the MTU (likely 1500 bytes), and the packets only make use of the first 1500 bytes of your extra large buffers.
Thus the solution should probably be to increase the number of buffers available to tc instead of increasing the system buffer size. But how many buffers do you need for this link? Based off of this answer the recommendation is to use 150% of the expected number of packets for your delay, so that’s:
buffers = (network rate / avg packet size) * delay * 150%
buffers = (5GBps / 1500B) * 0.1s * 150%
buffers = 333000 * 150%
buffers = 500000
You can see right away that that’s 500 times as many buffers as tc tries to use by default, or to put it another way you only had 2% of the buffers you needed so you saw 98% loss.
Thus to fix your problem, try changing your tc command from something like:
sudo tc qdisc add dev <device> root netem delay 0.1s
To something like:
sudo tc qdisc add dev <device> root netem delay 0.1s limit 500000

To my knowledge, even though its not what you are trying to achieve.. you should probably throtlle up the speed at which you are sending UDP packets because indeed as pointed out by #user3878723 buffers will quickly fill up and packets will be lost. Said differently - quite like #Ron Maupin - when applying delay the interface gets congested. I don't think the emitting process is aware of the 100ms delay so it might overwhelm all available resources quickly.
Instead you may have to tweak something like a Token Bucket Filter (TBF) if you want to go farther in your very use case. Also consider "Rate control".
UPDATE
It could be worth modifying these parameters and make them persistent
net.core.rmem_default
net.core.wmem_default
And/Or make sure you are using correctly these options in your emitter/receiver:
SO_SNDBUF
SO_RCVBUF
So that the whole chain has enough buffer.

How to emulate jitter WITHOUT packet reordering, using TC and NETEM?

Apparently NETEM uses tfifo, which queues packets based on time to sent. This results in jitter causing packet reorder. For example the following line will cause packet reordering*:
tc qdisc add dev eth0 root handle 1: netem delay 10ms 100ms
NETEM manual suggests if you don't want reordering, then replace the internal queue discipline tfifo with a pure packet fifo (pfifo), and gives the following example too add lots of jitter without reordering:
tc qdisc add dev eth0 root handle 1: netem delay 10ms 100ms
tc qdisc add dev eth0 parent 1:1 pfifo limit 1000
But it doesn't work! Packets still get reordered! (and it looks like it's kernel dependent according to this)
So, does anyone know how to add jitter WITHOUT reordering packets?

One hacky option is to use constant delay (no jitter) and have a loop and change the delay value in the loop.
Say you want a 50ms delay with 5ms variance. You first add the base delay:
tc qdisc add dev eth0 root handle 1: netem delay 50ms
And can the have a loop that picks a random delay between 45ms and 55ms and change the delay as below:
tc qdisc change dev eth0 root handle 1: netem delay 53ms
There are two things to keep in mind though:
1- It takes some ticks to change the delay. I found a sleep of 0.1s in the loop is reasonable. So this means your limited by the jitter frequency.
2- When you decrease the delay, new packets get queued with a smaller delay (i.e. earlier send time) than packets already in the queue, which can cause reordering! You can mitigate this by decreasing the delay in a few steps, if the decrease is significant.

In my case (Linux 4.17), I got the same ofo issue if the variance > mean. By setting the variance < mean, ofo does not happen any more. Of course you still need to use the pfifo qdisc:
tc qdisc add dev ethBr2 root handle 1:0 netem delay 50ms 40ms 25%
tc qdisc add dev ethBr2 parent 1:1 pfifo limit 1000

Using Unix TC to shape high bandwidth traffic

We actually have a 10Gb/s servers and 1Gb/s servers that coexist together (temporary migrating solution) [UDP traffic]. We would like to shape the traffic coming from the 10Gb/s servers in order to avoid big bursts that the 1G servers could not handle.
It seems that "tc" cannot do the job with a tbf (or maybe we use it the wrong way). For instance on our 10G servers we tried the following:
sudo tc qdisc add dev eth5 root tbf rate 950mbit latency 1s burst 50mbit peakrate 1000mbit mtu 1500
Here we normally set the peakrate at 1mb (which normally can't generate burst > 1mb/s).
Unfortunately, that does not work, in fact after using this tc config, we lower our main bandwidth to at max 2Mb/s..
Our only clue for this strange behavior is that sentence in the tc manual:
"To achieve perfection, the second bucket may contain only a single packet, which leads to the earlier mentioned 1mbit/s limit.
This limit is caused by the fact that the kernel can only throttle for at minimum 1 'jiffy', which depends on HZ as 1/HZ. For perfect shaping, only a single packet can get sent per jiffy - for HZ=100, this means 100 packets of on average 1000 bytes each, which roughly corresponds to 1mbit/s. "
So, it's sure we can't have a peakrate > 1Mbit/s ?
Maybe, there is another completely different way to achieve our goal, if anyone has a suggestion that would help me achieve our goal.. =) ?
Kind regards

Why do you have a 1s latency? Seems WAY too high for a 1 Gbit link

Introduce delay between each packet

So I know I can delay all the packets of a stream for a given delay using Linux tc and netem.
What is presented here http://www.linuxfoundation.org/collaborate/workgroups/networking/netem#Delay_distribution
just delays all of the packets for a given amount of time, not changing the intervals between the actual packets.
What I want to do is set the minimal interval time between each consecutive pair of packets to be say 100ms. And I don't want any reordering.
Any thought much appreciated.
Regards,
kravvcu

So, if I understood your requirement right, You want a constant interpacket delay of 100ms and no reordering. The command in the link you mentioned(linux foundation) introduces a delay of 100ms and a jitter of 20ms. This jitter creates reordering.
There are 2 approaches to meet your requirement.
if jitter is not required:-tc qdisc add/change/replace dev eth0 root netem delay 100ms
if jitter is required:-
The trick is to use a high rate parameter in your netem command. netem internally maintains a tfifo queue. with the rate parameter netem calculates the packet delay of the next packet based on the time-to-send of the last packet in its tfifo queue. Thus having delay and jitter but no reordering.
The command to the same is
tc qdisc add/change/replace dev eth0 root netem rate 1000mbit delay 100ms
rate 1000mbit or any rate which is very high does the job!
This feature is not documented anywhere. However, was discussed back in 2011/2012/2013 in the linux netdev mailing list. ATM I cannot find the link to the same. However, I can point to the linux source code which implements the above mentioned code.
http://lxr.free-electrons.com/source/net/sched/sch_netem.c#L495
Please vote if the answer was useful!

Simulate delayed and dropped packets on Linux

I would like to simulate packet delay and loss for UDP and TCP on Linux to measure the performance of an application. Is there a simple way to do this?

netem leverages functionality already built into Linux and userspace utilities to simulate networks. This is actually what Mark's answer refers to, by a different name.
The examples on their homepage already show how you can achieve what you've asked for:
Examples
Emulating wide area network delays
This is the simplest example, it just adds a fixed amount of delay to all packets going out of the local Ethernet.
# tc qdisc add dev eth0 root netem delay 100ms
Now a simple ping test to host on the local network should show an increase of 100 milliseconds. The delay is limited by the clock resolution of the kernel (Hz). On most 2.4 systems, the system clock runs at 100 Hz which allows delays in increments of 10 ms. On 2.6, the value is a configuration parameter from 1000 to 100 Hz.
Later examples just change parameters without reloading the qdisc
Real wide area networks show variability so it is possible to add random variation.
# tc qdisc change dev eth0 root netem delay 100ms 10ms
This causes the added delay to be 100 ± 10 ms. Network delay variation isn't purely random, so to emulate that there is a correlation value as well.
# tc qdisc change dev eth0 root netem delay 100ms 10ms 25%
This causes the added delay to be 100 ± 10 ms with the next random element depending 25% on the last one. This isn't true statistical correlation, but an approximation.
Delay distribution
Typically, the delay in a network is not uniform. It is more common to use a something like a normal distribution to describe the variation in delay. The netem discipline can take a table to specify a non-uniform distribution.
# tc qdisc change dev eth0 root netem delay 100ms 20ms distribution normal
The actual tables (normal, pareto, paretonormal) are generated as part of the iproute2 compilation and placed in /usr/lib/tc; so it is possible with some effort to make your own distribution based on experimental data.
Packet loss
Random packet loss is specified in the 'tc' command in percent. The smallest possible non-zero value is:
2−32 = 0.0000000232%
# tc qdisc change dev eth0 root netem loss 0.1%
This causes 1/10th of a percent (i.e. 1 out of 1000) packets to be randomly dropped.
An optional correlation may also be added. This causes the random number generator to be less random and can be used to emulate packet burst losses.
# tc qdisc change dev eth0 root netem loss 0.3% 25%
This will cause 0.3% of packets to be lost, and each successive probability depends by a quarter on the last one.
Probn = 0.25 × Probn-1 + 0.75 × Random
Note that you should use tc qdisc add if you have no rules for that interface or tc qdisc change if you already have rules for that interface. Attempting to use tc qdisc change on an interface with no rules will give the error RTNETLINK answers: No such file or directory.

For dropped packets I would simply use iptables and the statistic module.
iptables -A INPUT -m statistic --mode random --probability 0.01 -j DROP
Above will drop an incoming packet with a 1% probability. Be careful, anything above about 0.14 and most of you tcp connections will most likely stall completely.
Undo with -D:
iptables -D INPUT -m statistic --mode random --probability 0.01 -j DROP
Take a look at man iptables and search for "statistic" for more information.

iptables(8) has a statistic match module that can be used to match every nth packet. To drop this packet, just append -j DROP.

One of the most used tool in the scientific community to that purpose is DummyNet. Once you have installed the ipfw kernel module, in order to introduce 50ms propagation delay between 2 machines simply run these commands:
./ipfw pipe 1 config delay 50ms
./ipfw add 1000 pipe 1 ip from $IP_MACHINE_1 to $IP_MACHINE_2
In order to also introduce 50% of packet losses you have to run:
./ipfw pipe 1 config plr 0.5
Here more details.

An easy to use network fault injection tool is Saboteur. It can simulate:
Total network partition
Remote service dead (not listening on the expected port)
Delays
Packet loss
-TCP connection timeout (as often happens when two systems are separated by a stateful firewall)

Haven't tried it myself, but this page has a list of plugin modules that run in Linux' built in iptables IP filtering system. One of the modules is called "nth", and allows you to set up a rule that will drop a configurable rate of the packets. Might be a good place to start, at least.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string