Tune MTU on a per-socket basis? - linux

I was wondering if there is any way to tune (on a linux system), the MTU for a given socket. (To make IP layer fragmenting into chunks smaller that the actual device MTU).
When I say for a given socket, I don't mean programatically in the code of the application owning the socket but rather externally, for example via a sysfs entry.
If there is currently no way do that, do you have any ideas about where to hook/patch in linux kernel to implement such a possibility ?
Thanks.
EDIT: why the hell do I want to do that ?
I'm doing some Layer3-in-Layer4 (eg: tunneling IP and above through TCP tunnel) tunneling. Unlike VPN-like solutions, I'm not using a virtual interface to achieve that. I'm capturing packets using iptables, dropping them for their normal way and writing them to the tunnel socket.
Think about the case of a big file transfer, all packets are filled up to MTU size. When I tunnel them, I add some overhead, leading in every original packet to produce two tunneled packets, it's under-optimal.

If the socket is created such that DF set on outgoing packets you might have some luck in spoofing (injecting) an ICMP fragmentation needed message back at yourself until you end up with the desired MTU. Rather ugly, but depending on how desperate you are it might be appropriate.
You could for example generate these packets with iptables rules, so the matching and sending is simple and external to your application. It looks like the REJECT target for iptables doesn't have a reject-with of fragmentation needed though, it probably wouldn't be too tricky to add one.
The other approach, if it's only TCP packets you care about is you might have some luck with the socket option TCP_MAXSEG or the TCPMSS target if that's appropriate to your problem.
For UDP or raw you're free to send() packets as small as you fancy!
Update:
Based on the "why would I want to do that?" answer, it seems like fragmenting packets if DF isn't set or raising ICMP "fragmentation needed" and dropping would actually be the correct solution.
It's what a more "normal" router would do and provided firewalls don't eat the ICMP packet then it will behave sanely in all scenarios, whereas retrospectively changing things is a recipe for odd behaviour.
The iptables clamp mss is quite a good fix for TCP over this "VPN" though, especially as you're already making extensive use of iptables it seems.

MTU is a property of a link, not socket. They belong to different layers of the stack. That said TCP performs Path MTU discovery during the three-way handshake and tries very hard to avoid fragmentation. You'll have hard time making TCP send fragments. With UDP the easiest is to force some smallish MTU on an interface with ifconfig(8) and then send packets larger then that value.

Related

Linux Raw Sockets: Block Packets?

I've written my own packet sniffer in Linux.
I open a socket with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) and then process the Ethernet packets - unpacking ARP packets, IP packets (and ICMP / TCP / UDP packets inside those).
This is all working fine so far.
Now I can read packets like this - and I can also inject packets by wrapping up a suitable Ethernet packet and sending it.
But what I'd like is a means to block packets - to consume them, as it were, so that they don't get further delivered into the system.
That is, if a TCP packet is being sent to port 80, then I can see the packet with my packet sniffer and it'll get delivered to the web server in the usual fashion.
But, basically, I'd like it that if I spot something wrong with the packet - not coming from the right MAC address, malformed in some way, or just breaking security policy - that I can just "consume" the packet, and it won't get further delivered onto the web server.
Because I can read packets and write packets - if I can also just block packets as well, then I'll have all I need.
Basically, I don't just want to monitor network traffic, but sometimes have control over it. E.g. "re-route" a packet by consuming the original incoming packet and then writing out a new slightly-altered packet to a different address. Or just plain block packets that shouldn't be being delivered at all.
My application is to be a general "network traffic management" program. Monitors and logs traffic. But also controls it too - blocking packets as a firewall, re-routing packets as a load balancer.
In other words, I've got a packet sniffer - but if it sniffs something that smells bad, then I'd like it to be able to stop that packet. Discard it early, so it's not further delivered anywhere.
(Being able to alter packets on the way through might be handy too - but if I can block, then there's always the possibility to just block the original packet completely, but then write out a new altered packet in its place.)
What you are looking for is libnetfilter_queue. The documentation is still incredibly bad, but the code in this example should get you started.
I used this library to develop a project that queued network packets and replayed them at a later time.
A bit of a tangent, but it was relevant when I was resolving my problem. Blocking raw packets is relatively complicated, so it might make sense to consider doing that at a different layer. In other words, does your cloud provider let you set up firewall rules to drop specific kind of traffic?
In my case it was easier to do, which is why I'm suggesting such a lateral solution.

TCP_MAXSEG inaccurate? (Was: Linux path MTU probing not working on accept():ed socket if requested using setsockopt())

Question update: In addition to the problem below, it seems our client/server application using the Linux PLPMTUD mechanism gets too large path MTU. Has anyone seen this, i.e. actual path MTU being 1500, but getsockopt() w TCP_MAXSEG returning the MTU:s of the endpoints, in our case 3000? I have tried turning of GRO, GSO and TSO with ethtool but the error persists. Normal ping only manages to push through packets 1472 bytes or smaller. Also worth mentioning is that PLPMTUD works perfectly for smaller MTU:s. For example, w endpoints at 1500 MTU and one interface of the intermediate router set to e.g 1200 MTU, the kernel TCP probes and reports correct TCP_MAXSEG (1200 - headers).
I am using the Linux RFC4821-compliant packetization layer path MTU discovery in an application. Basically, the client does a setsockopt on a TCP socket:
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &sopt, sizeof(sopt));
with option value set to IP_PMTUDISC_PROBE. The setsockopt() does not return an error.
The client sends large tcp packets to a discard server, and the path MTU is calibrated by Linux kernel - tcpdump shows tcp packets with DF bit set being sent, packet size varies until the kernel knows the path MTU. However, to get this to work in the other direction (listening server accept:ing connections from clients, sending data and calibrating PMTU in direction from server to client) I have to set the global option for tcp path mtu discovery, /proc/sys/net/ipv4/tcp_mtu_probing. If I do not, server will stupidly continue to send too large packets, which get discarded by an intermediate router without ICMP sent back. Both endpoints have an MTU set to 3000, while the intermediate hops have MTU 1500.
I hope someone has an idea on what goes wrong. If more info is needed, let me know and I edit the question. Problems exist on both Linux kernel 4.2.0 and 3.19.0, both are stock Kubuntu LTS kernels. (x86/x86-64)
I do set the same socket option server-side as well, on all accept:ed sockets, before sending data in reverse direction.
FWIW, I have found workarounds/solutions for the problems, will do more testing but shortly describe my findings here, in case it helps someone else.
The problem with not being able to set path mtu discovery per socket was fixed, brute force, by enabling it system-wide during execution of my program, then disabling again.
The second problem, of incorrect path mtu returned by getsockopt TCP_MAXSEG, was fixed by waiting for TCP ACK of sent TCP data, also using getsockopt (tcp_info.tcpi_unacked). That way, I can be sure that probing has finished before I get TCP_MAXSEG.
Finally, there was a patchset for improving path mtu probing accuracy merged to mainline Linux kernel in March 2015. Without those patches, the probing is very imprecise. Patchset is part of 4.1.y-series kernels and onward.

Packet crafting and iptables

I want to test how the netfilter/ip6tables firewall handles some IPv6-related stuff like tiny/overlapped fragments, type 0 routing headers, excessive HPH options etc. For this I wanted to use Scapy to craft my own packets, but apparently Scapy using raw sockets means bypassing iptables. Is there another way of achieving my goal and how would I go about it? Some library I could use to make my own packets, which iptables can act on?
Run your packet injection program from a VM, and inspect the network connected to that VM.
Scapy is useful for such odd tasks. Sometimes what you want to do is just as easily done by writing small programs using the normal C APIs (including raw sockets in some cases, or TCP connections with odd options set). In many cases, a trivial TCP or UDP client in any high level language such as Python will do.

Why doesn't Linux IPSec implementation support fragmentation before encryption?

I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.
To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.

How to set the maximum TCP Maximum Segment Size on Linux?

In Linux, how do you set the maximum segment size that is allowed on a TCP connection? I need to set this for an application I did not write (so I cannot use setsockopt to do it). I need to set this ABOVE the mtu in the network stack.
I have two streams sharing the same network connection. One sends small packets periodically, which need absolute minimum latency. The other sends tons of data--I am using SCP to simulate that link.
I have setup traffic control (tc) to give the minimum latency traffic high priority. The problem I am running into, though, is that the TCP packets that are coming down from SCP end up with sizes up to 64K bytes. Yes, these are broken into smaller packets based on mtu, but this unfortunately occurs AFTER tc prioritizes the packets. Thus, my low latency packet gets stuck behind up to 64K bytes of SCP traffic.
This article indicates that on Windows you can set this value.
Is there something on Linux I can set? I've tried ip route and iptables, but these are applied too low in the network stack. I need to limit the TCP packet size before tc, so it can prioritize the high priority packets appropriately.
Are you using tcp segmentation offload to the nic? (You can use "ethtool -k $your_network_device" to see the offload settings.) This is the only way as far as I know that you would see 64k tcp packets with a device MTU of 1500. Not that this answers the question, but it might help avoid misdiagnosis.
ip route command with option advmss helps to set MSS value.
ip route add 192.168.1.0/24 dev eth0 advmss 1500
The upper bound of the advertised TCP MSS is the MTU of the first hop route. If you're seeing 64k segments, that tends to indicate that the first hop route MTU is excessively large - are you using loopback or something for testing?
MSS = MTU – 40bytes (standard TCP/IP overhead of 40 bytes [20+20])
If the MTU is 1500 bytes then the MSS will be 1460 bytes.
You are definitely misdiagnosing the problem; as someone else pointed out, tc doesn't see TCP packets, it sees IP packets, and they'd already be in chunks at that point.
You are probably just experiencing bufferbloat: you're overloading your outbound queue in a totally separate device (probably a DSL modem or cable modem). The only fix is to tell tc to limit your outbound bandwidth to less than the modem's bandwidth, eg. using TBF.

Resources