Why doesn't Linux IPSec implementation support fragmentation before encryption? - linux

I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.

To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.

Related

Linux Raw Sockets: Block Packets?

I've written my own packet sniffer in Linux.
I open a socket with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) and then process the Ethernet packets - unpacking ARP packets, IP packets (and ICMP / TCP / UDP packets inside those).
This is all working fine so far.
Now I can read packets like this - and I can also inject packets by wrapping up a suitable Ethernet packet and sending it.
But what I'd like is a means to block packets - to consume them, as it were, so that they don't get further delivered into the system.
That is, if a TCP packet is being sent to port 80, then I can see the packet with my packet sniffer and it'll get delivered to the web server in the usual fashion.
But, basically, I'd like it that if I spot something wrong with the packet - not coming from the right MAC address, malformed in some way, or just breaking security policy - that I can just "consume" the packet, and it won't get further delivered onto the web server.
Because I can read packets and write packets - if I can also just block packets as well, then I'll have all I need.
Basically, I don't just want to monitor network traffic, but sometimes have control over it. E.g. "re-route" a packet by consuming the original incoming packet and then writing out a new slightly-altered packet to a different address. Or just plain block packets that shouldn't be being delivered at all.
My application is to be a general "network traffic management" program. Monitors and logs traffic. But also controls it too - blocking packets as a firewall, re-routing packets as a load balancer.
In other words, I've got a packet sniffer - but if it sniffs something that smells bad, then I'd like it to be able to stop that packet. Discard it early, so it's not further delivered anywhere.
(Being able to alter packets on the way through might be handy too - but if I can block, then there's always the possibility to just block the original packet completely, but then write out a new altered packet in its place.)
What you are looking for is libnetfilter_queue. The documentation is still incredibly bad, but the code in this example should get you started.
I used this library to develop a project that queued network packets and replayed them at a later time.
A bit of a tangent, but it was relevant when I was resolving my problem. Blocking raw packets is relatively complicated, so it might make sense to consider doing that at a different layer. In other words, does your cloud provider let you set up firewall rules to drop specific kind of traffic?
In my case it was easier to do, which is why I'm suggesting such a lateral solution.

Raw ICMP socket interaction with internal stack windows/linux

In regard to using ICMP raw socket like in this example
sd = socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
There's some important question I didn't find an answer anywhere in the documentation. As far as I understand the ICMP protocol is implemented
on kernel level ((by %SystemRoot%\System32\Drivers\Tcpip.sys driver windows) .
So how this kernel logic interacts with the raw user space socket willing to send and receive the ICMP packets defined as in example above?
Is ICMP logic canceled since RAW socket is open and OS gives the application full control of ICMP? Or they are working in parallel (inevitably creating the mess on the network). Can I tell OS which ICMP packets I would like to handle exactly?
Answers for both linux and windows are welcome.
By using the raw socket with IPPROTO_ICMP you only get copies of the ICMP packets which arrive at your host (see How to receive ICMP request in C with raw sockets). The ICMP-logic in the network stack is still alive and will handle ICMP-messages.
So you just need to pick the ICMP packets of your interest after you received them (e.g. with the corresponding ID in the ICMP header). In the receive buffer you get filled by calling recv() you also get the complete IP header.
Under Linux there is even a socket option (ICMP_FILTER) with which you can set a receive-filter for different ICMP packets.

how does linux kernel process out-of-order tcp segment?

I'm devleoping a multi rx threads ethernet driver, but this may lead to potential issuse that delivery out-of-order packets to linux network stack. this issue has been verified on PPTP connection, because GRE has sequence number and will drop out-of-order packets.
So, does TCP has an tcp reassembly queue or similar mechanism to process out-of-order segment.
TCP has a window buffer. As packets arrive they are cached until the next expected packet sequence number is received. When the next expected packet is received (and it's valid), it's passed onto the application for receiving in order.
see https://www.quora.com/How-does-TCP-handle-the-duplicate-segments-and-out-of-order-segments

How can i drop packet in kernel after catched it by sock_raw in usermode?

I have use sock_raw get all ip packet from kernel.
socket(PF_PACKET, SOCK_RAW, htons(protocol);
But packet still alive in kernel, how can i drop it?
You cannot. When you receive the packet on a raw socket, the kernel has created a copy and delivered it to your receiving process. The packet will continue being processed in the meantime according to the usual semantics. It's likely this will have completed (i.e. whatever the stack would normally do with it will already be done) by the time your process receives it.
However, if the packet is not actually destined to your box (e.g. you're receiving it only because you have the network interface in promiscuous mode), or if there is no local process [or in-kernel component] interested in receiving it, the packet will just be discarded anyway.
If you simply wish to receive all packets that arrive on an interface without processing them, you can simply bring the interface up in promiscuous mode without giving it an IP address. Then packets will be delivered to your raw socket but will then be discarded by the stack.
Old question, but others might find this answer useful.
Depends on the usecase, but you can actually drop ingress packets after you get them from AF_PACKET SOCK_RAW. To do that, put an ingress qdisc where we have drop action. Example:
sudo tc qdisc add dev eth0 ingress
sudo tc filter add dev eth0 parent ffff: matchall action drop
Explaination: this works, because the AF_PACKET sniff the packet's copy from the per-device tap, which is a little bit earlier than the ingress qdisc in the kernel network stack's packet processing pipeline. That way you can implement a simple userspace switch.

Tune MTU on a per-socket basis?

I was wondering if there is any way to tune (on a linux system), the MTU for a given socket. (To make IP layer fragmenting into chunks smaller that the actual device MTU).
When I say for a given socket, I don't mean programatically in the code of the application owning the socket but rather externally, for example via a sysfs entry.
If there is currently no way do that, do you have any ideas about where to hook/patch in linux kernel to implement such a possibility ?
Thanks.
EDIT: why the hell do I want to do that ?
I'm doing some Layer3-in-Layer4 (eg: tunneling IP and above through TCP tunnel) tunneling. Unlike VPN-like solutions, I'm not using a virtual interface to achieve that. I'm capturing packets using iptables, dropping them for their normal way and writing them to the tunnel socket.
Think about the case of a big file transfer, all packets are filled up to MTU size. When I tunnel them, I add some overhead, leading in every original packet to produce two tunneled packets, it's under-optimal.
If the socket is created such that DF set on outgoing packets you might have some luck in spoofing (injecting) an ICMP fragmentation needed message back at yourself until you end up with the desired MTU. Rather ugly, but depending on how desperate you are it might be appropriate.
You could for example generate these packets with iptables rules, so the matching and sending is simple and external to your application. It looks like the REJECT target for iptables doesn't have a reject-with of fragmentation needed though, it probably wouldn't be too tricky to add one.
The other approach, if it's only TCP packets you care about is you might have some luck with the socket option TCP_MAXSEG or the TCPMSS target if that's appropriate to your problem.
For UDP or raw you're free to send() packets as small as you fancy!
Update:
Based on the "why would I want to do that?" answer, it seems like fragmenting packets if DF isn't set or raising ICMP "fragmentation needed" and dropping would actually be the correct solution.
It's what a more "normal" router would do and provided firewalls don't eat the ICMP packet then it will behave sanely in all scenarios, whereas retrospectively changing things is a recipe for odd behaviour.
The iptables clamp mss is quite a good fix for TCP over this "VPN" though, especially as you're already making extensive use of iptables it seems.
MTU is a property of a link, not socket. They belong to different layers of the stack. That said TCP performs Path MTU discovery during the three-way handshake and tries very hard to avoid fragmentation. You'll have hard time making TCP send fragments. With UDP the easiest is to force some smallish MTU on an interface with ifconfig(8) and then send packets larger then that value.

Resources