Why tc cannot do ingress shaping? Does ingress shaping make sense? - linux

In my work, I found tc can do egress shaping, and can only do ingress policing. I wonder that why tc doesn't implement ingress shaping?
Code sample:
#ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 \
u32 match ip src 0.0.0.0/0 police rate 256kbit \
burst 10k drop flowid :1
#egress
tc qdisc add dev eth0 root tbf \
rate 256kbit latency 25ms burst 10k
But I can't do this:
#ingress shaping, using tbf
tc qdisc add dev eth0 ingress tbf \
rate 256kbit latency 25ms burst 10k
I found a solution called IFB(updated IMQ) can redirect the traffic to egress. But it seems not a good solution because it's wasting CPU. So I don't want to use this.
Does ingress shaping make sense? And why tc doesn't support it?

Although tc shaping rules for ingress are very limited, you can create a virtual interface and apply egress rules to it, as described here:
https://serverfault.com/questions/350023/tc-ingress-policing-and-ifb-mirroring
(You may not need the virtual interface if your VMs already use virtual interfaces and you can apply tc to them.)
The caveat with ingress shaping is that it may take a long time for an incoming stream to respond to your shaping actions, due to all the buffers in routers between the stream source and your interface. And until the stream does respond to a reduced limit, it will continue to flood your downstream! Meanwhile you will be throwing away good packets, reducing your throughput.
Likewise when a high-priority stream ends or drops off, it will take some time for the low-priority stream to grow back to its full rate. This can be quite disruptive if it happens often!
The result of this is that dynamic shaping may work as desired for groups of steady rate long-lived streams, but will offer little advantage to short-lived or varying rate high-priority streams when your downstream is flooded: the low-priority streams will simply take too long to back off. However classifying and limiting low and medium-priority packets to a static rate somewhere below your maximum downrate could be helpful, to guarantee at least some space for high-priority data.
I don't have any figures on this, and latency has improved a lot since the ADSL days. So I think it may be worth testing, if low latency or high throughput of high-priority packets is something you desire more than overall throughput, and you can live with the limitations above.
As Janoszen and the ADSL HOWTO mention, streams could respond much more quickly if we could adjust the TCP window size as part of the shaping.
Search TLDP for further research.

Shaping works on the send buffer. Ingress shaping would require control over the remote send buffer.

Related

Using Unix TC to shape high bandwidth traffic

We actually have a 10Gb/s servers and 1Gb/s servers that coexist together (temporary migrating solution) [UDP traffic]. We would like to shape the traffic coming from the 10Gb/s servers in order to avoid big bursts that the 1G servers could not handle.
It seems that "tc" cannot do the job with a tbf (or maybe we use it the wrong way). For instance on our 10G servers we tried the following:
sudo tc qdisc add dev eth5 root tbf rate 950mbit latency 1s burst 50mbit peakrate 1000mbit mtu 1500
Here we normally set the peakrate at 1mb (which normally can't generate burst > 1mb/s).
Unfortunately, that does not work, in fact after using this tc config, we lower our main bandwidth to at max 2Mb/s..
Our only clue for this strange behavior is that sentence in the tc manual:
"To achieve perfection, the second bucket may contain only a single packet, which leads to the earlier mentioned 1mbit/s limit.
This limit is caused by the fact that the kernel can only throttle for at minimum 1 'jiffy', which depends on HZ as 1/HZ. For perfect shaping, only a single packet can get sent per jiffy - for HZ=100, this means 100 packets of on average 1000 bytes each, which roughly corresponds to 1mbit/s. "
So, it's sure we can't have a peakrate > 1Mbit/s ?
Maybe, there is another completely different way to achieve our goal, if anyone has a suggestion that would help me achieve our goal.. =) ?
Kind regards
Why do you have a 1s latency? Seems WAY too high for a 1 Gbit link

Introduce delay between each packet

So I know I can delay all the packets of a stream for a given delay using Linux tc and netem.
What is presented here http://www.linuxfoundation.org/collaborate/workgroups/networking/netem#Delay_distribution
just delays all of the packets for a given amount of time, not changing the intervals between the actual packets.
What I want to do is set the minimal interval time between each consecutive pair of packets to be say 100ms. And I don't want any reordering.
Any thought much appreciated.
Regards,
kravvcu
So, if I understood your requirement right, You want a constant interpacket delay of 100ms and no reordering. The command in the link you mentioned(linux foundation) introduces a delay of 100ms and a jitter of 20ms. This jitter creates reordering.
There are 2 approaches to meet your requirement.
if jitter is not required:-tc qdisc add/change/replace dev eth0 root netem delay 100ms
if jitter is required:-
The trick is to use a high rate parameter in your netem command. netem internally maintains a tfifo queue. with the rate parameter netem calculates the packet delay of the next packet based on the time-to-send of the last packet in its tfifo queue. Thus having delay and jitter but no reordering.
The command to the same is
tc qdisc add/change/replace dev eth0 root netem rate 1000mbit delay 100ms
rate 1000mbit or any rate which is very high does the job!
This feature is not documented anywhere. However, was discussed back in 2011/2012/2013 in the linux netdev mailing list. ATM I cannot find the link to the same. However, I can point to the linux source code which implements the above mentioned code.
http://lxr.free-electrons.com/source/net/sched/sch_netem.c#L495
Please vote if the answer was useful!

Traffic shaping with tc is inaccurate with high bandwidth and delay

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?
It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.
Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Simulate delayed and dropped packets on Linux

I would like to simulate packet delay and loss for UDP and TCP on Linux to measure the performance of an application. Is there a simple way to do this?
netem leverages functionality already built into Linux and userspace utilities to simulate networks. This is actually what Mark's answer refers to, by a different name.
The examples on their homepage already show how you can achieve what you've asked for:
Examples
Emulating wide area network delays
This is the simplest example, it just adds a fixed amount of delay to all packets going out of the local Ethernet.
# tc qdisc add dev eth0 root netem delay 100ms
Now a simple ping test to host on the local network should show an increase of 100 milliseconds. The delay is limited by the clock resolution of the kernel (Hz). On most 2.4 systems, the system clock runs at 100 Hz which allows delays in increments of 10 ms. On 2.6, the value is a configuration parameter from 1000 to 100 Hz.
Later examples just change parameters without reloading the qdisc
Real wide area networks show variability so it is possible to add random variation.
# tc qdisc change dev eth0 root netem delay 100ms 10ms
This causes the added delay to be 100 ± 10 ms. Network delay variation isn't purely random, so to emulate that there is a correlation value as well.
# tc qdisc change dev eth0 root netem delay 100ms 10ms 25%
This causes the added delay to be 100 ± 10 ms with the next random element depending 25% on the last one. This isn't true statistical correlation, but an approximation.
Delay distribution
Typically, the delay in a network is not uniform. It is more common to use a something like a normal distribution to describe the variation in delay. The netem discipline can take a table to specify a non-uniform distribution.
# tc qdisc change dev eth0 root netem delay 100ms 20ms distribution normal
The actual tables (normal, pareto, paretonormal) are generated as part of the iproute2 compilation and placed in /usr/lib/tc; so it is possible with some effort to make your own distribution based on experimental data.
Packet loss
Random packet loss is specified in the 'tc' command in percent. The smallest possible non-zero value is:
2−32 = 0.0000000232%
# tc qdisc change dev eth0 root netem loss 0.1%
This causes 1/10th of a percent (i.e. 1 out of 1000) packets to be randomly dropped.
An optional correlation may also be added. This causes the random number generator to be less random and can be used to emulate packet burst losses.
# tc qdisc change dev eth0 root netem loss 0.3% 25%
This will cause 0.3% of packets to be lost, and each successive probability depends by a quarter on the last one.
Probn = 0.25 × Probn-1 + 0.75 × Random
Note that you should use tc qdisc add if you have no rules for that interface or tc qdisc change if you already have rules for that interface. Attempting to use tc qdisc change on an interface with no rules will give the error RTNETLINK answers: No such file or directory.
For dropped packets I would simply use iptables and the statistic module.
iptables -A INPUT -m statistic --mode random --probability 0.01 -j DROP
Above will drop an incoming packet with a 1% probability. Be careful, anything above about 0.14 and most of you tcp connections will most likely stall completely.
Undo with -D:
iptables -D INPUT -m statistic --mode random --probability 0.01 -j DROP
Take a look at man iptables and search for "statistic" for more information.
iptables(8) has a statistic match module that can be used to match every nth packet. To drop this packet, just append -j DROP.
One of the most used tool in the scientific community to that purpose is DummyNet. Once you have installed the ipfw kernel module, in order to introduce 50ms propagation delay between 2 machines simply run these commands:
./ipfw pipe 1 config delay 50ms
./ipfw add 1000 pipe 1 ip from $IP_MACHINE_1 to $IP_MACHINE_2
In order to also introduce 50% of packet losses you have to run:
./ipfw pipe 1 config plr 0.5
Here more details.
An easy to use network fault injection tool is Saboteur. It can simulate:
Total network partition
Remote service dead (not listening on the expected port)
Delays
Packet loss
-TCP connection timeout (as often happens when two systems are separated by a stateful firewall)
Haven't tried it myself, but this page has a list of plugin modules that run in Linux' built in iptables IP filtering system. One of the modules is called "nth", and allows you to set up a rule that will drop a configurable rate of the packets. Might be a good place to start, at least.

Increasing the maximum number of TCP/IP connections in Linux

I am programming a server and it seems like my number of connections is being limited since my bandwidth isn't being saturated even when I've set the number of connections to "unlimited".
How can I increase or eliminate a maximum number of connections that my Ubuntu Linux box can open at a time? Does the OS limit this, or is it the router or the ISP? Or is it something else?
Maximum number of connections are impacted by certain limits on both client & server sides, albeit a little differently.
On the client side:
Increase the ephermal port range, and decrease the tcp_fin_timeout
To find out the default values:
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_fin_timeout
The ephermal port range defines the maximum number of outbound sockets a host can create from a particular I.P. address. The fin_timeout defines the minimum time these sockets will stay in TIME_WAIT state (unusable after being used once).
Usual system defaults are:
net.ipv4.ip_local_port_range = 32768 61000
net.ipv4.tcp_fin_timeout = 60
This basically means your system cannot consistently guarantee more than (61000 - 32768) / 60 = 470 sockets per second. If you are not happy with that, you could begin with increasing the port_range. Setting the range to 15000 61000 is pretty common these days. You could further increase the availability by decreasing the fin_timeout. Suppose you do both, you should see over 1500 outbound connections per second, more readily.
To change the values:
sysctl net.ipv4.ip_local_port_range="15000 61000"
sysctl net.ipv4.tcp_fin_timeout=30
The above should not be interpreted as the factors impacting system capability for making outbound connections per second. But rather these factors affect system's ability to handle concurrent connections in a sustainable manner for large periods of "activity."
Default Sysctl values on a typical Linux box for tcp_tw_recycle & tcp_tw_reuse would be
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_tw_reuse=0
These do not allow a connection from a "used" socket (in wait state) and force the sockets to last the complete time_wait cycle. I recommend setting:
sysctl net.ipv4.tcp_tw_recycle=1
sysctl net.ipv4.tcp_tw_reuse=1
This allows fast cycling of sockets in time_wait state and re-using them. But before you do this change make sure that this does not conflict with the protocols that you would use for the application that needs these sockets. Make sure to read post "Coping with the TCP TIME-WAIT" from Vincent Bernat to understand the implications. The net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you. Note that net.ipv4.tcp_tw_recycle has been removed from Linux 4.12.
On the Server Side:
The net.core.somaxconn value has an important role. It limits the maximum number of requests queued to a listen socket. If you are sure of your server application's capability, bump it up from default 128 to something like 128 to 1024. Now you can take advantage of this increase by modifying the listen backlog variable in your application's listen call, to an equal or higher integer.
sysctl net.core.somaxconn=1024
txqueuelen parameter of your ethernet cards also have a role to play. Default values are 1000, so bump them up to 5000 or even more if your system can handle it.
ifconfig eth0 txqueuelen 5000
echo "/sbin/ifconfig eth0 txqueuelen 5000" >> /etc/rc.local
Similarly bump up the values for net.core.netdev_max_backlog and net.ipv4.tcp_max_syn_backlog. Their default values are 1000 and 1024 respectively.
sysctl net.core.netdev_max_backlog=2000
sysctl net.ipv4.tcp_max_syn_backlog=2048
Now remember to start both your client and server side applications by increasing the FD ulimts, in the shell.
Besides the above one more popular technique used by programmers is to reduce the number of tcp write calls. My own preference is to use a buffer wherein I push the data I wish to send to the client, and then at appropriate points I write out the buffered data into the actual socket. This technique allows me to use large data packets, reduce fragmentation, reduces my CPU utilization both in the user land and at kernel-level.
There are a couple of variables to set the max number of connections. Most likely, you're running out of file numbers first. Check ulimit -n. After that, there are settings in /proc, but those default to the tens of thousands.
More importantly, it sounds like you're doing something wrong. A single TCP connection ought to be able to use all of the bandwidth between two parties; if it isn't:
Check if your TCP window setting is large enough. Linux defaults are good for everything except really fast inet link (hundreds of mbps) or fast satellite links. What is your bandwidth*delay product?
Check for packet loss using ping with large packets (ping -s 1472 ...)
Check for rate limiting. On Linux, this is configured with tc
Confirm that the bandwidth you think exists actually exists using e.g., iperf
Confirm that your protocol is sane. Remember latency.
If this is a gigabit+ LAN, can you use jumbo packets? Are you?
Possibly I have misunderstood. Maybe you're doing something like Bittorrent, where you need lots of connections. If so, you need to figure out how many connections you're actually using (try netstat or lsof). If that number is substantial, you might:
Have a lot of bandwidth, e.g., 100mbps+. In this case, you may actually need to up the ulimit -n. Still, ~1000 connections (default on my system) is quite a few.
Have network problems which are slowing down your connections (e.g., packet loss)
Have something else slowing you down, e.g., IO bandwidth, especially if you're seeking. Have you checked iostat -x?
Also, if you are using a consumer-grade NAT router (Linksys, Netgear, DLink, etc.), beware that you may exceed its abilities with thousands of connections.
I hope this provides some help. You're really asking a networking question.
To improve upon the answer given by #derobert,
You can determine what your OS connection limit is by catting nf_conntrack_max. For example:
cat /proc/sys/net/netfilter/nf_conntrack_max
You can use the following script to count the number of TCP connections to a given range of tcp ports. By default 1-65535.
This will confirm whether or not you are maxing out your OS connection limit.
Here's the script.
#!/bin/sh
OS=$(uname)
case "$OS" in
'SunOS')
AWK=/usr/bin/nawk
;;
'Linux')
AWK=/bin/awk
;;
'AIX')
AWK=/usr/bin/awk
;;
esac
netstat -an | $AWK -v start=1 -v end=65535 ' $NF ~ /TIME_WAIT|ESTABLISHED/ && $4 !~ /127\.0\.0\.1/ {
if ($1 ~ /\./)
{sip=$1}
else {sip=$4}
if ( sip ~ /:/ )
{d=2}
else {d=5}
split( sip, a, /:|\./ )
if ( a[d] >= start && a[d] <= end ) {
++connections;
}
}
END {print connections}'
In an application level, here are something a developer can do:
From server side:
Check if load balancer(if you have),works correctly.
Turn slow TCP timeouts into 503 Fast Immediate response, if you load balancer work correctly, it should pick the working resource to serve, and it's better than hanging there with unexpected error massages.
Eg: If you are using node server, u can use toobusy from npm.
Implementation something like:
var toobusy = require('toobusy');
app.use(function(req, res, next) {
if (toobusy()) res.send(503, "I'm busy right now, sorry.");
else next();
});
Why 503? Here are some good insights for overload:
http://ferd.ca/queues-don-t-fix-overload.html
We can do some work in client side too:
Try to group calls in batch, reduce the traffic and total requests number b/w client and server.
Try to build a cache mid-layer to handle unnecessary duplicates requests.
im trying to resolve this in 2022 on loadbalancers and one way I found is to attach another IPv4 (or eventualy IPv6) to NIC, so the limit is now doubled. Of course you need to configure the second IP to the service which is trying to connect to the machine (in my case another DNS entry)

Resources