Using Unix TC to shape high bandwidth traffic - linux

We actually have a 10Gb/s servers and 1Gb/s servers that coexist together (temporary migrating solution) [UDP traffic]. We would like to shape the traffic coming from the 10Gb/s servers in order to avoid big bursts that the 1G servers could not handle.
It seems that "tc" cannot do the job with a tbf (or maybe we use it the wrong way). For instance on our 10G servers we tried the following:
sudo tc qdisc add dev eth5 root tbf rate 950mbit latency 1s burst 50mbit peakrate 1000mbit mtu 1500
Here we normally set the peakrate at 1mb (which normally can't generate burst > 1mb/s).
Unfortunately, that does not work, in fact after using this tc config, we lower our main bandwidth to at max 2Mb/s..
Our only clue for this strange behavior is that sentence in the tc manual:
"To achieve perfection, the second bucket may contain only a single packet, which leads to the earlier mentioned 1mbit/s limit.
This limit is caused by the fact that the kernel can only throttle for at minimum 1 'jiffy', which depends on HZ as 1/HZ. For perfect shaping, only a single packet can get sent per jiffy - for HZ=100, this means 100 packets of on average 1000 bytes each, which roughly corresponds to 1mbit/s. "
So, it's sure we can't have a peakrate > 1Mbit/s ?
Maybe, there is another completely different way to achieve our goal, if anyone has a suggestion that would help me achieve our goal.. =) ?
Kind regards

Why do you have a 1s latency? Seems WAY too high for a 1 Gbit link

Related

Need to measure pairwire (ip based) bandwidth used over time when using TC

I need to measure the datarates of packets between multiple servers. I need pairwise bandwidths between the servers (if possible even the ports), not the overall datarate per interface on each server.
Example output
Timestamp
Server A to B
Server B to A
Server A to C
Server C to A
0
1
2
1
5
1
5
3
7
1
What I tried or thought of
tcpdump - I was capturing all the packets and looking at ip.len for getting the datarates. It worked quite well till I started testing along with TC.
Turns out tcpdump captures packets at a lower layer than TC. So, the bandwidths I measure using this can't see the limit set by TC.
netstat - I tried using this by greping the output and look at Recv-Q and Send-Q columns. But later I found out that it reports the bytes that have been received and are buffered, waiting for the local process that is using this connection to read and consume them. I won't be able to use them to get bandwidth being used.
iftop - Amazing GUI and has all the things I need. But no way to get the output in a good way to process. Might also overwhelm the storage because of the amount of extra text it stores along with.
bwm-ng - Gives overall datarate per interface on each server but not pairwise.
Please let me know if there are any other ways to achieve what I need.
Thanks in advance for your help.

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.
To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.
When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).
TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

Why tc cannot do ingress shaping? Does ingress shaping make sense?

In my work, I found tc can do egress shaping, and can only do ingress policing. I wonder that why tc doesn't implement ingress shaping?
Code sample:
#ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 \
u32 match ip src 0.0.0.0/0 police rate 256kbit \
burst 10k drop flowid :1
#egress
tc qdisc add dev eth0 root tbf \
rate 256kbit latency 25ms burst 10k
But I can't do this:
#ingress shaping, using tbf
tc qdisc add dev eth0 ingress tbf \
rate 256kbit latency 25ms burst 10k
I found a solution called IFB(updated IMQ) can redirect the traffic to egress. But it seems not a good solution because it's wasting CPU. So I don't want to use this.
Does ingress shaping make sense? And why tc doesn't support it?
Although tc shaping rules for ingress are very limited, you can create a virtual interface and apply egress rules to it, as described here:
https://serverfault.com/questions/350023/tc-ingress-policing-and-ifb-mirroring
(You may not need the virtual interface if your VMs already use virtual interfaces and you can apply tc to them.)
The caveat with ingress shaping is that it may take a long time for an incoming stream to respond to your shaping actions, due to all the buffers in routers between the stream source and your interface. And until the stream does respond to a reduced limit, it will continue to flood your downstream! Meanwhile you will be throwing away good packets, reducing your throughput.
Likewise when a high-priority stream ends or drops off, it will take some time for the low-priority stream to grow back to its full rate. This can be quite disruptive if it happens often!
The result of this is that dynamic shaping may work as desired for groups of steady rate long-lived streams, but will offer little advantage to short-lived or varying rate high-priority streams when your downstream is flooded: the low-priority streams will simply take too long to back off. However classifying and limiting low and medium-priority packets to a static rate somewhere below your maximum downrate could be helpful, to guarantee at least some space for high-priority data.
I don't have any figures on this, and latency has improved a lot since the ADSL days. So I think it may be worth testing, if low latency or high throughput of high-priority packets is something you desire more than overall throughput, and you can live with the limitations above.
As Janoszen and the ADSL HOWTO mention, streams could respond much more quickly if we could adjust the TCP window size as part of the shaping.
Search TLDP for further research.
Shaping works on the send buffer. Ingress shaping would require control over the remote send buffer.

Traffic shaping with tc is inaccurate with high bandwidth and delay

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?
It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.
Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Simulate delayed and dropped packets on Linux

I would like to simulate packet delay and loss for UDP and TCP on Linux to measure the performance of an application. Is there a simple way to do this?
netem leverages functionality already built into Linux and userspace utilities to simulate networks. This is actually what Mark's answer refers to, by a different name.
The examples on their homepage already show how you can achieve what you've asked for:
Examples
Emulating wide area network delays
This is the simplest example, it just adds a fixed amount of delay to all packets going out of the local Ethernet.
# tc qdisc add dev eth0 root netem delay 100ms
Now a simple ping test to host on the local network should show an increase of 100 milliseconds. The delay is limited by the clock resolution of the kernel (Hz). On most 2.4 systems, the system clock runs at 100 Hz which allows delays in increments of 10 ms. On 2.6, the value is a configuration parameter from 1000 to 100 Hz.
Later examples just change parameters without reloading the qdisc
Real wide area networks show variability so it is possible to add random variation.
# tc qdisc change dev eth0 root netem delay 100ms 10ms
This causes the added delay to be 100 ± 10 ms. Network delay variation isn't purely random, so to emulate that there is a correlation value as well.
# tc qdisc change dev eth0 root netem delay 100ms 10ms 25%
This causes the added delay to be 100 ± 10 ms with the next random element depending 25% on the last one. This isn't true statistical correlation, but an approximation.
Delay distribution
Typically, the delay in a network is not uniform. It is more common to use a something like a normal distribution to describe the variation in delay. The netem discipline can take a table to specify a non-uniform distribution.
# tc qdisc change dev eth0 root netem delay 100ms 20ms distribution normal
The actual tables (normal, pareto, paretonormal) are generated as part of the iproute2 compilation and placed in /usr/lib/tc; so it is possible with some effort to make your own distribution based on experimental data.
Packet loss
Random packet loss is specified in the 'tc' command in percent. The smallest possible non-zero value is:
2−32 = 0.0000000232%
# tc qdisc change dev eth0 root netem loss 0.1%
This causes 1/10th of a percent (i.e. 1 out of 1000) packets to be randomly dropped.
An optional correlation may also be added. This causes the random number generator to be less random and can be used to emulate packet burst losses.
# tc qdisc change dev eth0 root netem loss 0.3% 25%
This will cause 0.3% of packets to be lost, and each successive probability depends by a quarter on the last one.
Probn = 0.25 × Probn-1 + 0.75 × Random
Note that you should use tc qdisc add if you have no rules for that interface or tc qdisc change if you already have rules for that interface. Attempting to use tc qdisc change on an interface with no rules will give the error RTNETLINK answers: No such file or directory.
For dropped packets I would simply use iptables and the statistic module.
iptables -A INPUT -m statistic --mode random --probability 0.01 -j DROP
Above will drop an incoming packet with a 1% probability. Be careful, anything above about 0.14 and most of you tcp connections will most likely stall completely.
Undo with -D:
iptables -D INPUT -m statistic --mode random --probability 0.01 -j DROP
Take a look at man iptables and search for "statistic" for more information.
iptables(8) has a statistic match module that can be used to match every nth packet. To drop this packet, just append -j DROP.
One of the most used tool in the scientific community to that purpose is DummyNet. Once you have installed the ipfw kernel module, in order to introduce 50ms propagation delay between 2 machines simply run these commands:
./ipfw pipe 1 config delay 50ms
./ipfw add 1000 pipe 1 ip from $IP_MACHINE_1 to $IP_MACHINE_2
In order to also introduce 50% of packet losses you have to run:
./ipfw pipe 1 config plr 0.5
Here more details.
An easy to use network fault injection tool is Saboteur. It can simulate:
Total network partition
Remote service dead (not listening on the expected port)
Delays
Packet loss
-TCP connection timeout (as often happens when two systems are separated by a stateful firewall)
Haven't tried it myself, but this page has a list of plugin modules that run in Linux' built in iptables IP filtering system. One of the modules is called "nth", and allows you to set up a rule that will drop a configurable rate of the packets. Might be a good place to start, at least.

Resources