Ingress/egress confusion in tc - linux

Can someone explain me please, because I don't understand the following concept.
In tc you can add a dummy qdisc which can process a fraction of traffic by some specific rules.
For exapmle, here you create an explicit ingress qdisc for eth0.
No idea by the way what is the point of this, like ingress qdisc isn't included by default.
$TC qdisc add dev eth0 ingress handle ffff:0
Then you apply a filter which calls an action to redirect incoming traffic
with some rule (0 0) to a dummy device (ifb0).
But the filtered traffic is marked as "egress"!
Why is that so? SHouldn't this traffic also appear as ingress in ifb0?
$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action mirred egress redirect dev ifb0
Or does ingress mean any traffic queued inside qdisk (both incoming and outgoing traffic).
So let's say the network card received some data and before starting working with it, the kernel queued it in some qdisc. That data is ingress. The moment this data is dequeued for processing by the system, it became egress?
And the vice versa, the application sends some data to some ip address, so before giving this data to network card, the kernel puts this data into appropriate qdisc. So when it happens this data becomes ingress. Then after the data was processed by an appropriate class and was dequeued to be passed to network card, this data became egress?
Or maybe it's ingress is all traffic coming from the network card to the kernel?
In this case why there is egress in
action mirred egress redirect dev ifb0
Is it because the traffic is taken from the "ingress" part of the root qdisc owned by the network card, so when the "taking for redirection" takes place this data becomes "egress"?
Why "egress"?
I don't understand(

Indeed, but consider this:
The TC qdisc direction pertains to the actual direction of traffic. Ingress means network port->interface as according to this reference:
https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.adv-qdisc.ingress.html
The TC filter action mirror/redirect direction is relative to the interface. Ingress means mirror/redirect the packet as it comes into the filter. Egress means mirror/redirect the packet as it goes out of the filter. The difference is that other actions can potentially transform the packet on a match. So what goes into the filter might be different from what goes out of the filter. The command basically allows the user to decide if the original packet or the modified packet is to be mirrored/redirected.
Check this out: https://man7.org/linux/man-pages/man8/tc-mirred.8.html

Related

Receive and TC-redirect traffic with any VLAN tag

I have an interface eth0, from which I want to mirror all incoming traffic to, say, eth1, so I use the following commands:
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: prio 1 u32 match u32 0 0 action mirred ingress mirror dev eth1
It works perfectly with everything except VLAN-tagged packets - they can be seen in Wireshark on eth0, but they do not appear on eth1. If I do:
vconfig add eth0 $TAG
Where $TAG is some tag from my input traffic, the corresponding packets start to appear on eth1.
But, as I have written, I want to capture all incoming traffic, and that means I want to capture all VLAN tags as well. I'm pretty sure it will work if I add all tags from 2 up to 4094, creating all those sub-interfaces, but I wonder if there is some smarter way to do so? Also, I'm concerned about performance issues, which may appear when having so much interfaces. Thanks!
After researching more I have discovered that the utility bridge, which is part of iproute2 collection, can be used to solve this issue. I have no idea why the following approach works, but you can try to do the same if you have a similar problem.
So, suppose we already have the setup I described earlier (without vconfig-adding the VLAN tags), then do:
ip link add bridge_dev type bridge
ip link set bridge_dev up
ip link set eth0 master bridge_dev
These commands create a bridge device, turn it on and set eth0 device as a slave of the bridge_dev. Right after the last command I start seeing the VLAN-tagged packets on eth1, as expected.
Note that you can also add VLAN tags to bridge devices, like:
bridge vlan add vid 2-4094 dev bridge_dev self
I expected to only see VLAN packets after that command, but apparently it's not necessary. I would welcome anyone who could explain, why this works.

Netem and virtual interfaces

I need to emulate a network, introducing for instance random delay, and I need help to use NetEm.
The scenario consists in two Ubuntu 14.04 machines: A and B.
A and B have ip addresses 192.168.0.1 and 192.168.0.2 on eth1.
To avoid to mess with the NICs eth1, I have set virtual interfaces eth1:
sudo ifconfig eth1:1 192.168.1.x/24 up
At this point, only on B, I add the delay as follows:
sudo tc qdisc add dev eth1:1 root netem delay 50ms 10ms 25%
The problem is that this delay is experienced also on the physical NICs eth1. I mean, if I ping the addresses on eth1 (192.168.0.1 pings 192.168.0.2), the packets are delayed as if they were heading to eth1:1. Instead, I would expect delay only on eth1:1.
What happened? How can I solve this problem?
Moreover, I read that in this way, the network impairments affect only the egress traffic. How can I introduce delays both for the egress and the ingress traffic?

How can i drop packet in kernel after catched it by sock_raw in usermode?

I have use sock_raw get all ip packet from kernel.
socket(PF_PACKET, SOCK_RAW, htons(protocol);
But packet still alive in kernel, how can i drop it?
You cannot. When you receive the packet on a raw socket, the kernel has created a copy and delivered it to your receiving process. The packet will continue being processed in the meantime according to the usual semantics. It's likely this will have completed (i.e. whatever the stack would normally do with it will already be done) by the time your process receives it.
However, if the packet is not actually destined to your box (e.g. you're receiving it only because you have the network interface in promiscuous mode), or if there is no local process [or in-kernel component] interested in receiving it, the packet will just be discarded anyway.
If you simply wish to receive all packets that arrive on an interface without processing them, you can simply bring the interface up in promiscuous mode without giving it an IP address. Then packets will be delivered to your raw socket but will then be discarded by the stack.
Old question, but others might find this answer useful.
Depends on the usecase, but you can actually drop ingress packets after you get them from AF_PACKET SOCK_RAW. To do that, put an ingress qdisc where we have drop action. Example:
sudo tc qdisc add dev eth0 ingress
sudo tc filter add dev eth0 parent ffff: matchall action drop
Explaination: this works, because the AF_PACKET sniff the packet's copy from the per-device tap, which is a little bit earlier than the ingress qdisc in the kernel network stack's packet processing pipeline. That way you can implement a simple userspace switch.

Is my current approach to force outgoing packets through a bonded interface as the gw device appropriate?

I have two NIC cards with 4 ports each on Redhat 6.1.
When the application comes up, it creates a bonded interface with one port from each NIC (example: eth1 and eth4), and assigns a virtual IP to that interface. Once this interface is up, all the packets from this machine should go through the bonded interface.
To achieve this currently, I'm changing the default gw device name to the bonded interface using the ip route command: ip route replace default via 10.3.2.1 dev INT-BOND.
When stopping the application, we bring down the bonded interface and change the default gw device name back to eth0.
The problem with my approach is if someone brings down the bonded interface (ifdown), then it removes the default gw.
I need confirmation that my currently working approach is fine to proceed with going forward; otherwise, should I go with modifications to the iptables/ip rules, or are there any better suggestions?

How to set the maximum TCP Maximum Segment Size on Linux?

In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.
Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.
ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500
The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?
MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.
You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.

Resources