I am currently facing a trouble in my code.
Mainly, I emulate the connection between two computers, connected via an ethernet bridge (Raspberry Pi, Raspbian). So I am able to influence parameters of this connection (like bandwidth, latency and much more) via tc qdisc.
This works out fine, as you can see in the code down below.
But now to my problem:
I am also trying to exclude specific port ranges, what means ports that aren't influenced by my given parameters (latency etc..).
For that I created two prio bands. The prio band 0 (higher priority) handles my port exclusion (already in the parent root).
Afterwards in prio band 1 (lower priority), I decline a latency via netem.
The whole data traffic will pass through my influenced prio band 1, the remaining (excluded data) will pass uninfluenced through prio band 0.
I don't get kernel errors while executing my code! But I only receive filter parent 1: protocol ip pref 1 basic after typing sudo tc filter show dev eth1.
My match is not even mentioned. What did I wrong?
Can you explain me why I don't get my excpected output?
THIS IS MY CODE (in right order of executioning):
PARENT ROOT
sudo tc qdisc add dev eth1 root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This creates two priobands (1:1 and 1:2)
BAND 0 [PORT EXCLUSION | port 100 - 800]
sudo tc qdisc add dev eth1 parent 1:1 handle 10: tbf rate 512kbit buffer 1600 limit 3000
Creates a tbf (Token Bucket Filter) to set bandwidth
sudo tc filter add dev eth1 parent 1: protocol ip prio 1 handle 0x10 basic match "cmp(u16 at 0 layer transport lt 100) and cmp(u16 at 0 layer transport gt 800)" flowid 1:1
Creates a filter with specific handle, that excludes port 100 to 800 from the prioband 1 (the influenced data packets)
BAND 1 [NET EMULATION]
sudo tc qdisc add dev eth1 parent 1:2 handle 20: tbf rate 1024kbit buffer 1600 limit 3000
Compare with tbf above
sudo tc qdisc add dev eth1 parent 20:1 handle 21: netem delay 200ms
Creates via netem a delay of 200ms
Here you can see my hierarchy as an image
The question again:
My filter match is not even mentioned. What did I wrong?
Can you explain me why I don't get my excpected output?
I appreciate any kind of help! Thanks for your efforts!
~rotsechs
It seems like I have to neglect the missing output! Nevertheless, it works perfectly.
I established a SSH connection to my Ethernet Bridge (via MobaXterm). Afterwards I laid a delay of 400ms on it. The console inputs slowed down as expected.
Finally I created the filter and excluded the port range from 20 to 24 (SSH has port 22). The delay of my SSH connection disappeared immediately!
I have apache server running on Ubuntu. Client connects and downloads an image. I need to extract RTT estimations for the underlying TCP connection. Is there a way to do this? Maybe something like running my tcp stack in debug mode to have it log this info somewhere?
Note that I don't want to run tcpdump and extract RTTs from the recorded trace! I need the TCP stack's RTT estimations (apparently this is part of the info you can get with TCP_INFO socket option). Basically need something like tcpprob (kprobe) to insert a hook and record the estimated RTT of the TCP connection on every incoming packet (or on every change).
UPDATE:
I found a solution. rtt, congestion window and more can be logged using tcpprobe. I posted an answer below.
This can be done without the need for any additional kernel modules using the ss command (part of the iproute package), which can provide detailed info on open sockets. It won't show it for every packet but most of this info is calculated over a number of packets. E.g. To list the currently open TCP (t option) sockets and associated internal TCP info (i) information - including congestion control algorithm, rtt, cwnd etc:
ss -ti
Here's some example output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 192.168.56.102:ssh 192.168.56.1:46327
cubic wscale:6,7 rto:201 rtt:0.242/0.055 ato:40 mss:1448 rcvmss:1392
advmss:1448 cwnd:10 bytes_acked:33169 bytes_received:6069 segs_out:134
segs_in:214 send 478.7Mbps lastsnd:5 lastrcv:6 lastack:5
pacing_rate 955.4Mbps rcv_rtt:3 rcv_space:28960
This can be done using tcpprobe, which is a module that inserts a hook into the tcp_recv processing path using kprobe records the state of a TCP connection in response to incoming packets.
Let's say you want to probe tcp connection on port 443, you need to do the following:
sudo modprobe tcp_probe port=443 full=1
sudo chmod 444 /proc/net/tcpprobe
cat /proc/net/tcpprobe > /tmp/output.out &
pid=$!
full=1: log on every ack packet received
full=0: log on only condo changes (if you use this your output might be empty)
Now pid is the process which is logging the probe. To stop, simply kill this process:
kill $pid
The format of output.out (according to the source at line 198):
[time][src][dst][length][snd_nxt][snd_una][snd_cwnd][ssthresh][snd_wnd][srtt][rcv_wnd]
I'm writing a network bridge that reassembles and analyzes TCP flows on the fly. I have a pair of multi-queue NICS and I use netmap to capture packets from each rx queues on different threads and than pass them on to the other NIC for transmission. The problem is, packets from the same flow do not land on the same queue on the two NICs, due to the source and destination addresses and port being reversed.
I tried ethtool to change the tuple the hash of which is used for distributing packets. Running this:
# ethtool --show-nfc p1p1 rx-flow-hash tcp4
results in:
TCP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]
Repeating the above command for the other NIC results in the same queue. I thought I could change the order for one of the rings by running ethtool --config-nfc p1p1 rx-flow-hash tcp4 dsfn but that doesn't change the order of the fields used in the hash. It seems that both sdfn and dsfn result in the same tuple. The same is true for ds and sd.
Is there a way to make the packets in one flow always land on the same queue (and the same thread) on both NICs whether using ethtool to configure the NICs or otherwise another method or tool?
I wrote a simple UDP Server program to understand more about possible network bottlenecks.
UDP Server: Creates a UDP socket, binds it to a specified port and addr, and adds the socket file descriptor to epoll interest list. Then its epoll waits for incoming packet. On reception of incoming packet(EPOLLIN), its reads the packet and just prints the received packet length. Pretty simple, right :)
UDP Client: I used hping as shown below:
hping3 192.168.1.2 --udp -p 9996 --flood -d 100
When I send udp packets at 100 packets per second, I dont find any UDP packet loss. But when I flood udp packets (as shown in above command), I see significant packet loss.
Test1:
When 26356 packets are flooded from UDP client, my sample program receives ONLY 12127 packets and the remaining 14230 packets is getting dropped by kernel as shown in /proc/net/snmp output.
cat /proc/net/snmp | grep Udp:
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 12372 0 14230 218 14230 0
For Test1 packet loss percentage is ~53%.
I verified there is NOT much loss at hardware level using "ethtool -S ethX" command both on client side and server side, while at the appln level I see a loss of 53% as said above.
Hence to reduce packet loss I tried these:
- Increased the priority of my sample program using renice command.
- Increased Receive Buffer size (both at system level and process level)
Bump up the priority to -20:
renice -20 2022
2022 (process ID) old priority 0, new priority -20
Bump up the receive buf size to 16MB:
At Process Level:
int sockbufsize = 16777216;
setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF,(char *)&sockbufsize, (int)sizeof(sockbufsize))
At Kernel Level:
cat /proc/sys/net/core/rmem_default
16777216
cat /proc/sys/net/core/rmem_max
16777216
After these changes, performed Test2.
Test2:
When 1985076 packets are flooded from UDP client, my sample program receives 1848791 packets and the remaining 136286 packets is getting dropped by kernel as shown in /proc/net/snmp output.
cat /proc/net/snmp | grep Udp:
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 1849064 0 136286 236 0 0
For Test2 packet loss percentage is 6%.
Packet loss is reduced significantly. But I have the following questions:
Can the packet loss be further reduced?!? I know I am greedy here :) But I am just trying to find out if its possible to reduce packet loss further.
Unlike Test1, in Test2 InErrors doesnt match RcvbufErrors and RcvbufErrors is always zero. Can someone explain the reason behind it, please?!? What exactly is the difference between InErrors and RcvbufErrors. I understand RcvbufErrors but NOT InErrors.
Thanks for your help and time!!!
Tuning the Linux kernel's networking stack to reduce packet drops is a bit involved as there are a lot of tuning options from the driver all the way up through the networking stack.
I wrote a long blog post explaining all the tuning parameters from top to bottom and explaining what each of the fields in /proc/net/snmp mean so you can figure out why those errors are happening. Take a look, I think it should help you get your network drops down to 0.
If there aren't drops at hardware level then should be mostly a question of memory, you should be able to tweak the kernel configuration parameters to reach 0 drops (obviously you need a reasonable balanced hardware for the network traffic you're recv'ing).
I think you're missing netdev_max_backlog which is important for incoming packets:
Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.
InErrors is composed of:
corrupted packets (incorrect headers or checksum)
full RCV buffer size
So my guess is you have fixed the buffer overflow problem (RcvbufErrors is 0) and what is left are packets with incorrect checksums.
ifconfig 1.2.3.4 mtu 1492
This will set MTU to 1492 for incoming, outgoing packets or both? I think it is only for incoming
TLDR: Both. It will only transmit packets with a payload length less than or equal to that size. Similarly, it will only accept packets with a payload length within your MTU. If a device sends a larger packet, it should respond with an ICMP unreachable (oversized) message.
The nitty gritty:
Tuning the MTU for your device is useful because other hops between you and your destination may encapsulate your packet in another form (for example, a VPN or PPPoE.) This layer around your packet results in a bigger packet being sent along the wire. If this new, larger packet exceeds the maximum size of the layer, then the packet will be split into multiple packets (in a perfect world) or will be dropped entirely (in the real world.)
As a practical example, consider having a computer connected over ethernet to an ADSL modem that speaks PPPoE to an ISP. Ethernet allows for a 1500 byte payload, of which 8 bytes will be used by PPPoE. Now we're down to 1492 bytes that can be delivered in a single packet to your ISP. If you were to send a full-size ethernet payload of 1500 bytes, it would get "fragmented" by your router and split into two packets (one with a 1492 byte payload, the other with an 8 byte payload.)
The problem comes when you want to send more data over this connection - lets say you wanted to send 3000 bytes: your computer would split this up based on your MTU - in this case, two packets of 1500 bytes each, and send them to your ADSL modem which would then split them up so that it can fulfill its MTU. Now your 3000 byte data has been fragmented into four packets: two with a payload of 1492 bytes and two with a payload of 8 bytes. This is obviously inefficient, we really only need three packets to send this data. Had your computer been configured with the correct MTU for the network, it would have sent this as three packets in the first place (two 1492 byte packets and one 16 byte packet.)
To avoid this inefficiency, many IP stacks flip a bit in the IP header called "Don't Fragment." In this case, we would have sent our first 1500 byte packet to the ADSL modem and it would have rejected the packet, replying with an Internet Control (ICMP) message informing us that our packet is too large. We then would have retried the transmission with a smaller packet. This is called Path MTU discovery. Similarly, a layer below, at the TCP layer, another factor in avoiding fragmentation is the MSS (Maximum Segment Size) option where both hosts reply with the maximum size packet they can transfer without fragmenting. This is typically computed from the MTU.
The problem here arises when misconfigured firewalls drop all ICMP traffic. When you connect to (say) a web server, you build a TCP session and send that you're willing to accept TCP packets based on your 1500 byte MTU (since you're connected over ethernet to your router.) If the foreign web server wanted to send you a lot of data, they would split this into chunks that (when combined with the TCP and IP headers) came out to 1500 byte payloads and send them to you. Your ISP would receive one of these and then try to wrap it into a PPPoE packet to send to your ADSL modem, but it would be too large to send. So it would reply with an ICMP unreachable, which would (in a perfect world) cause the remote computer to downsize its MSS for the connection and retransmit. If there was a broken firewall in the way, however, this ICMP message would never be reached by the foreign web server and this packet would never make it to you.
Ultimately setting your MTU on your ethernet device is desirable to send the right size frames to your ADSL modem (to avoid it asking you to retransmit with a smaller frame), but it's critical to influence the MSS size you send to remote hosts when building TCP connections.
ifconfig ... mtu <value> sets the MTU for layer2 payloads sent out the interface, and will reject larger layer2 payloads received on this interface. You must ensure your MTU matches on both sides of an ethernet link; you should not have mismatched mtu values anywhere in the same ethernet broadcast domain. Note that the ethernet headers are not included in the MTU you are setting.
Also, ifconfig has not been maintained in linux for ages and is old and deprecated; sadly linux distributions still include it because they're afraid of breaking old scripts. This has the very negative effect of encouraging people to continue using it. You should be using the iproute2 family of commands:
[mpenning#hotcoffee ~]$ sudo ip link set mtu 1492 eth0
[mpenning#hotcoffee ~]$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc mq state UP qlen 1000
link/ether 00:1e:c9:cd:46:c8 brd ff:ff:ff:ff:ff:ff
[mpenning#hotcoffee ~]$
Large incoming packets may be dropped based on the interface MTU size.
For example, the default MTU 1500 on
Linux 2.6 CentOS (tested with Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01))
drops Jumbo packets >1504. Errors appear in ifconfig and there are rx_long_length_errors indications for this in ethtool -S output.
Increasing MTU indicates Jumbo packets should be supported.
The threshold for when to drop packets based on their size being too large appears to depend on MTU (-4096, -8192, etc.)
Oren
It's the Maximum Transmission Unit, so it definitely sets the outgoing maximum packet size. I'm not sure if will reject incoming packets larger than the MTU.
There is no doubt that MTU configured by ifconfig impacts Tx ip fragmentation, I have no more comments.
But for Rx direction, I find whether the parameter impacts incoming IP packets, it depends. Different manufacturer behaves differently.
I tested all the devices on hand and found 3 cases below.
Test case:
Device0 eth0 (192.168.225.1, mtu 2000)<--ETH cable-->Device1 eth0
(192.168.225.34, mtu MTU_SIZE)
On Device0 ping 192.168.225.34 -s ICMP_SIZE,
Checking how MTU_SIZE impacts Rx of Device1.
case 1:
Device1 = Linux 4.4.0 with Intel I218-LM:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above. It seems that there is a PRACTICAL MTU=1504 (20B(IP header)+8B(ICMP header)+1476B(ICMP data)).
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1476, 1478, 1600, 1900. It seems that jumbo frame is switched on once MTU_SIZE is set >1500 and there is no 1504 restriction any more.
case 2:
Device1 = Linux 3.18.31 with Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1476, fails at ICMP_SIZE=1477 and above.
When MTU_SIZE=1490, ping succeeds at ICMP_SIZE=1466, fails at ICMP_SIZE=1467 and above.
When MTU_SIZE=1501, ping succeeds at ICMP_SIZE=1477, fails at ICMP_SIZE=1478 and above.
When MTU_SIZE=500, ping succeeds at ICMP_SIZE=476, fails at ICMP_SIZE=477 and above.
When MTU_SIZE=1900, ping succeeds at ICMP_SIZE=1876, fails at ICMP_SIZE=1877 and above.
This case behaves exactly as Edward Thomson said, except that in my test the PRACTICAL MTU=MTU_SIZE+4.
case 3:
Device1 = Linux 4.4.50 with Raspberry Pi 2 Module B ETH:
When MTU_SIZE=1500, ping succeeds at ICMP_SIZE=1472, fails at ICMP_SIZE=1473 and above. So there is a PRACTICAL MTU=1500 (20B(IP header)+8B(ICMP header)+1472B(ICMP data)) working there.
When MTU_SIZE=1490, behave the same as MTU_SIZE=1500.
When MTU_SIZE=1501, behave the same as MTU_SIZE=1500.
When MTU_SIZE=2000, behave the same as MTU_SIZE=1500.
When MTU_SIZE=500, behave the same as MTU_SIZE=1500.
This case behaves exactly as Ron Maupin said in Why MTU configuration doesn't take effect on receiving direction?.
To sum it all, in real world, after you set ifconfig mtu,
sometimes the Rx IP packts get dropped when exceed 1504 , no matter what MTU value you set (except that the jumbo frame is enabled).
sometimes the Rx IP packts get dropped when exceed the MTU+4 you set on receiving device.
sometimes the Rx IP packts get dropped when exceed 1500, no matter what MTU value you set.
... ...