Linux Raw Sockets: Block Packets? - linux

I've written my own packet sniffer in Linux.
I open a socket with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) and then process the Ethernet packets - unpacking ARP packets, IP packets (and ICMP / TCP / UDP packets inside those).
This is all working fine so far.
Now I can read packets like this - and I can also inject packets by wrapping up a suitable Ethernet packet and sending it.
But what I'd like is a means to block packets - to consume them, as it were, so that they don't get further delivered into the system.
That is, if a TCP packet is being sent to port 80, then I can see the packet with my packet sniffer and it'll get delivered to the web server in the usual fashion.
But, basically, I'd like it that if I spot something wrong with the packet - not coming from the right MAC address, malformed in some way, or just breaking security policy - that I can just "consume" the packet, and it won't get further delivered onto the web server.
Because I can read packets and write packets - if I can also just block packets as well, then I'll have all I need.
Basically, I don't just want to monitor network traffic, but sometimes have control over it. E.g. "re-route" a packet by consuming the original incoming packet and then writing out a new slightly-altered packet to a different address. Or just plain block packets that shouldn't be being delivered at all.
My application is to be a general "network traffic management" program. Monitors and logs traffic. But also controls it too - blocking packets as a firewall, re-routing packets as a load balancer.
In other words, I've got a packet sniffer - but if it sniffs something that smells bad, then I'd like it to be able to stop that packet. Discard it early, so it's not further delivered anywhere.
(Being able to alter packets on the way through might be handy too - but if I can block, then there's always the possibility to just block the original packet completely, but then write out a new altered packet in its place.)

What you are looking for is libnetfilter_queue. The documentation is still incredibly bad, but the code in this example should get you started.
I used this library to develop a project that queued network packets and replayed them at a later time.

A bit of a tangent, but it was relevant when I was resolving my problem. Blocking raw packets is relatively complicated, so it might make sense to consider doing that at a different layer. In other words, does your cloud provider let you set up firewall rules to drop specific kind of traffic?
In my case it was easier to do, which is why I'm suggesting such a lateral solution.

Related

Why doesn't Linux IPSec implementation support fragmentation before encryption?

I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.
To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.

Networking with Python: No response from IP Phone

I'm an Automation Developer and lately I've taken it upon myself to control an IP Phone on my desk (Cisco 7940).
I have a third party application that can control the IP phone with SCCP (Skinny) packets. Through Wireshark, I see that the application will send 4 unique SCCP packets and then receives a TCP ACK message.
SCCP is not very well known, but it looks like this:
Ethernet( IP( TCP( SCCP( ))))
Using a Python packet builder: Scapy, I've been able to send the same 4 packets to the IP Phone, however I never get the ACK. In my packets, I have correctly set the sequence, port and acknowledge values in the TCP header. The ID field in the IP header is also correct.
The only thing I can imagine wrong is that it takes Python a little more than a full second to send the four packets. Whereas the application takes significantly less time. I've tried raising the priority for the Python shell with no luck.
Does anyone have an idea why I may not be receiving the ACK back?
This website may be helpful in debugging why on your machine you aren't seeing the traffic you expect, and taking steps to modify your environment to produce the desired output.
Normally, the Linux kernel takes care of setting up and sending and
receiving network traffic. It automatically sets appropriate header
values and even knows how to complete a TCP 3 way handshake. Uising
the kernel services in this way is using a "cooked" socket.
Scapy does not use these kernel services. It creates a "raw" socket. The
entire TCP/IP stack of the OS is circumvented. Because of this, Scapy
give us compete control over the traffic. Traffic to and from Scapy
will not be filtered by iptables. Also, we will have to take care of
the TCP 3 way handshake ourselves.
http://www.packetlevel.ch/html/scapy/scapy3way.html

how is TCP's checksum calculated when we use tcpdump to capture packets which we send out

I am trying to generate a series of packets to simulate the TCP 3-way handshake procedure, my first step is to capture the real connecting packets, and try to re-send the same packets from the same machine, but it didn't work at first.
finally I found it out that the packet I captured with tcpdump is not exactly what my computer sent out, the TCP's checksum field is changed and it lead me to thinkk that I can establish a tcp connection even the TCP checksum is incorrect.
so my question is how is the checksum field calculated? is it modified by tcpdump or hardware? why is it changed? Is it a bug of tcpdump? or it's because the calculation is omitted.
the following is the screenshot I captured from my host machine and a virtual machinne, you can see that the same packet captured on differnet machine are all the same except for the TCP checksum.
and the small window is my virtual machine, I used command "ssh 10.82.25.138" from the host to generate these packets
What you are seeing may be the result of checksum offloading. To quote from the wireshark wiki (http://wiki.wireshark.org/CaptureSetup/Offloading):
Most modern operating systems support some form of network offloading,
where some network processing happens on the NIC instead of the CPU.
Normally this is a great thing. It can free up resources on the rest
of the system and let it handle more connections. If you're trying to
capture traffic it can result in false errors and strange or even
missing traffic.
On systems that support checksum offloading, IP, TCP, and UDP
checksums are calculated on the NIC just before they're transmitted on
the wire. In Wireshark these show up as outgoing packets marked black
with red Text and the note [incorrect, should be xxxx (maybe caused by
"TCP checksum offload"?)].
Wireshark captures packets before they are sent to the network
adapter. It won't see the correct checksum because it has not been
calculated yet. Even worse, most OSes don't bother initialize this
data so you're probably seeing little chunks of memory that you
shouldn't.
Although this is for wireshark, the same principle applies. In your host machine, you see the wrong checksum because it just hasn't been filled in yet. It looks right on the guest, because before it's sent out on the "wire" it is filled in. Try disabling checksum offloading on the interface which is handling this traffic, e.g.:
ethtool -K eth0 rx off tx off
if it's eth0.

Where are the missing TCP packets?

I observed a surprising thing that when there are both udp-based and tcp-based applications sending packets, if the upd-based application sent the packets so fast that the bandwith are nearly filled with udp packets, then the tcp packets would be very hard to send out.
The surprising thing is that though tcp-based application is able to send a few packets out (observed by the return value of write()), the receiver of the tcp packets never receives them. Why? Is that because the tcp-packets arenot finally sent out by the network card? or the tcp packets are actually dropped by routers?
Thanks,
Steve
First, the return value of write() is not an indicator of whether packets were sent. It just indicates that the data was buffered.
Second, if you are saturating the network with UDP packets there will be a lot of packet loss, and TCP being adaptive will adapt to that by sending packets out more slowly. If the packet loss gets too high TCP can basically stop altogether. The solution is not to saturate the network with UDP packets.
This is a simplified answer. There are many articles you can read up on.
UDP is a layer built upon IP. Ditto for TCP. The network card just sends out IP packets. You can look up the various structures for these packets.
TCP is a protocol that uses IP packets but uses a mechanism to try to ensure delivery and rearranges packets in the correct order. See the article on Wikipedia.
Routers are free to drop packets. This can occur when the network is overloaded, network connections are down or the IP packet is corrupted.
So to answer your question their is no preference between UDP or IP to be transmitted from one end to the other.

Tune MTU on a per-socket basis?

I was wondering if there is any way to tune (on a linux system), the MTU for a given socket. (To make IP layer fragmenting into chunks smaller that the actual device MTU).
When I say for a given socket, I don't mean programatically in the code of the application owning the socket but rather externally, for example via a sysfs entry.
If there is currently no way do that, do you have any ideas about where to hook/patch in linux kernel to implement such a possibility ?
Thanks.
EDIT: why the hell do I want to do that ?
I'm doing some Layer3-in-Layer4 (eg: tunneling IP and above through TCP tunnel) tunneling. Unlike VPN-like solutions, I'm not using a virtual interface to achieve that. I'm capturing packets using iptables, dropping them for their normal way and writing them to the tunnel socket.
Think about the case of a big file transfer, all packets are filled up to MTU size. When I tunnel them, I add some overhead, leading in every original packet to produce two tunneled packets, it's under-optimal.
If the socket is created such that DF set on outgoing packets you might have some luck in spoofing (injecting) an ICMP fragmentation needed message back at yourself until you end up with the desired MTU. Rather ugly, but depending on how desperate you are it might be appropriate.
You could for example generate these packets with iptables rules, so the matching and sending is simple and external to your application. It looks like the REJECT target for iptables doesn't have a reject-with of fragmentation needed though, it probably wouldn't be too tricky to add one.
The other approach, if it's only TCP packets you care about is you might have some luck with the socket option TCP_MAXSEG or the TCPMSS target if that's appropriate to your problem.
For UDP or raw you're free to send() packets as small as you fancy!
Update:
Based on the "why would I want to do that?" answer, it seems like fragmenting packets if DF isn't set or raising ICMP "fragmentation needed" and dropping would actually be the correct solution.
It's what a more "normal" router would do and provided firewalls don't eat the ICMP packet then it will behave sanely in all scenarios, whereas retrospectively changing things is a recipe for odd behaviour.
The iptables clamp mss is quite a good fix for TCP over this "VPN" though, especially as you're already making extensive use of iptables it seems.
MTU is a property of a link, not socket. They belong to different layers of the stack. That said TCP performs Path MTU discovery during the three-way handshake and tries very hard to avoid fragmentation. You'll have hard time making TCP send fragments. With UDP the easiest is to force some smallish MTU on an interface with ifconfig(8) and then send packets larger then that value.

Resources