Raw ICMP socket interaction with internal stack windows/linux - linux

In regard to using ICMP raw socket like in this example
sd = socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
There's some important question I didn't find an answer anywhere in the documentation. As far as I understand the ICMP protocol is implemented
on kernel level ((by %SystemRoot%\System32\Drivers\Tcpip.sys driver windows) .
So how this kernel logic interacts with the raw user space socket willing to send and receive the ICMP packets defined as in example above?
Is ICMP logic canceled since RAW socket is open and OS gives the application full control of ICMP? Or they are working in parallel (inevitably creating the mess on the network). Can I tell OS which ICMP packets I would like to handle exactly?
Answers for both linux and windows are welcome.

By using the raw socket with IPPROTO_ICMP you only get copies of the ICMP packets which arrive at your host (see How to receive ICMP request in C with raw sockets). The ICMP-logic in the network stack is still alive and will handle ICMP-messages.
So you just need to pick the ICMP packets of your interest after you received them (e.g. with the corresponding ID in the ICMP header). In the receive buffer you get filled by calling recv() you also get the complete IP header.
Under Linux there is even a socket option (ICMP_FILTER) with which you can set a receive-filter for different ICMP packets.

Related

Linux Raw Sockets: Block Packets?

I've written my own packet sniffer in Linux.
I open a socket with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) and then process the Ethernet packets - unpacking ARP packets, IP packets (and ICMP / TCP / UDP packets inside those).
This is all working fine so far.
Now I can read packets like this - and I can also inject packets by wrapping up a suitable Ethernet packet and sending it.
But what I'd like is a means to block packets - to consume them, as it were, so that they don't get further delivered into the system.
That is, if a TCP packet is being sent to port 80, then I can see the packet with my packet sniffer and it'll get delivered to the web server in the usual fashion.
But, basically, I'd like it that if I spot something wrong with the packet - not coming from the right MAC address, malformed in some way, or just breaking security policy - that I can just "consume" the packet, and it won't get further delivered onto the web server.
Because I can read packets and write packets - if I can also just block packets as well, then I'll have all I need.
Basically, I don't just want to monitor network traffic, but sometimes have control over it. E.g. "re-route" a packet by consuming the original incoming packet and then writing out a new slightly-altered packet to a different address. Or just plain block packets that shouldn't be being delivered at all.
My application is to be a general "network traffic management" program. Monitors and logs traffic. But also controls it too - blocking packets as a firewall, re-routing packets as a load balancer.
In other words, I've got a packet sniffer - but if it sniffs something that smells bad, then I'd like it to be able to stop that packet. Discard it early, so it's not further delivered anywhere.
(Being able to alter packets on the way through might be handy too - but if I can block, then there's always the possibility to just block the original packet completely, but then write out a new altered packet in its place.)
What you are looking for is libnetfilter_queue. The documentation is still incredibly bad, but the code in this example should get you started.
I used this library to develop a project that queued network packets and replayed them at a later time.
A bit of a tangent, but it was relevant when I was resolving my problem. Blocking raw packets is relatively complicated, so it might make sense to consider doing that at a different layer. In other words, does your cloud provider let you set up firewall rules to drop specific kind of traffic?
In my case it was easier to do, which is why I'm suggesting such a lateral solution.

Why doesn't Linux IPSec implementation support fragmentation before encryption?

I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.
To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.

how to send/inject packet into local network interface (linux)

I am working on a C program on Linux (kernel 2.6.18). I need to send/inject IP packets (e.g., over a socket) in my Linux systems, but make the same Linux "think" that these packets are incoming from another host. I creat a datalink socket and use faked source mac/ip for the packets sent over this socket. The destination mac/ip are set to the ones in my local Linux. However, whether I send these packets in a user-space program or in a kernel module, my local Linux just doesn't think these packets are coming from outside. For example, if I create a datalink socket to send an ICMP request destined to my local Linux, I expect my local Linux to think this ICMP request coming from outside, and would respond with an ICMP reply, but my local Linux does not do so. (However, with the same program I can send a faked ICMP request to another host, and that host does respond an ICMP reply.)
I did some research on this topic online, and it seems all related solution suggest using TAP. But as this VirtualBox article says:
... TAP is no longer necessary on Linux with bridged networking, ...
I am very interested to know how this is possible. Thanks.

Minimum requirements for custom networking stack to send UDP packets?

(edit: solved -- see below)
This is my situation:
TL-MR3020 -----ethernet----- mbed
OpenWRT C/C++ custom networking stack
192.168.2.1 192.168.2.16
TL-MR3020 is a Linux embedded router
mbed is an ARM microcontroller.
On the network I just want them to exchange messages using UDP packets on port 2225. In particular, TL-MR3020 has to periodically send packets every second to 192.168.2.16:2225, while mbed has to periodically send packets every 50ms to 192.168.2.1:2225.
Everything was good untill I removed the network stack library from mbed (lwIP, not so lightweight for me) and written a new minimal stack.
My new stacks sends 5 gratuitous ARP reply just after the ethernet link gets up, then starts sending and receiving udp packets.
Now TL-MR3020 doesn't receive any UDP packet. In particular, with ifconfig I can see packets coming, but my application can't get them.
Also, if I connect my laptop instead of the TL-MR3020, I can see the UDP packets coming, using Wireshark. There's nothing wrong, except done for my application.
I have a node.js script that has to receive the packets, but it doesn't receive nothing, but if I send UDP packets from local to local, the script receives them.
I think that my application is OK also because neither SOCAT can receive the UDP packets using socat - UDP-LISTEN:2225.
I've already checked on TL-MR3020:
arp table has the correct ip-mac assiciation
destination MAC address matches the incoming interface
destination IP address matches the incoming interface
IP checksum: wireshark says good=false, bad=false
UDP checksum: wireshark says good=false, bad=false
So, I'm asking... what are the minimum requirements for a custom networking stack to send UDP packets?
SOLVED:
You need a good checksum in the IP header.
UDP checksum, my case, can be set to zero.
tcpdump is very helpful (thanks to AndrewMcDonnell)

How do I prevent Linux kernel from responding to incoming TCP packets?

For my application, I need to intercept certain TCP/IP packets and route them to a different device over a custom communications link (not Ethernet). I need all the TCP control packets and the full headers. I have figured out how to obtain these using a raw socket via socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)); This works well and allows me to attach filters to just see the TCP port I'm interested in.
However, Linux also sees these packets. By default, it sends a RST when it receives a packet to a TCP port number it doesn't know about. That's no good as I plan to send back a response myself later. If I open up a second "normal" socket on that same port using socket(PF_INET, SOCK_STREAM, 0); and listen() on it, Linux then sends ACK to incoming TCP packets. Neither of these options is what I want. I want it to do nothing with these packets so I can handle everything myself. How can I accomplish this?
I would like to do the same thing. My reason is from a security perspective… I am wanting to construct a Tarpit application. I intent to forward TCP traffic from certain source IPs to the Tarpit. The Tarpit must receive the ACK. It will reply with a SYN/ACK of its own. I do not want the kernel to respond. Hence, a raw socket will not work (because the supplied TCP packets are teed), I need to also implement a Divert socket. That's about all I know so far… have not yet implemented.

Resources