how does linux kernel process out-of-order tcp segment? - linux

I'm devleoping a multi rx threads ethernet driver, but this may lead to potential issuse that delivery out-of-order packets to linux network stack. this issue has been verified on PPTP connection, because GRE has sequence number and will drop out-of-order packets.
So, does TCP has an tcp reassembly queue or similar mechanism to process out-of-order segment.

TCP has a window buffer. As packets arrive they are cached until the next expected packet sequence number is received. When the next expected packet is received (and it's valid), it's passed onto the application for receiving in order.
see https://www.quora.com/How-does-TCP-handle-the-duplicate-segments-and-out-of-order-segments

Related

Linux drops UDP packets

I have written a C++ tool for my linux machine which receives UDP (OSC) packets and sends them back immediately (thats the only thing it does). But it seems that there is some amount of dropped packets. When I send 100 packets to my linux (from another machine), mostly only 64 packets are returned. I have looked at the incoming packets with tcpdump. It tells me the following:
64 packets captured
64 packets received by filter
0 packets dropped by kernel
So where are they?
UDP, by design, does not guarantee that the packets arrive at the destination. The packets missing might no have reached your machine at all, and thus will not appear in the incoming packets.
UDP is mostly used for streams and games, as loosing a few packets does not really matter.
If you want to be sure that all the packets arrive, you should use TCP.
Let me know if this helps.

Linux TCP: packet segmentation?

I am working on a virtualization environment (Linux over HyperV). The Linux driver for the virtual NIC supports TSO and GSO (tcp segmentation is ON and generic segmentation is ON).
Now, I create TCP socket and the send buffer set to 128K.
But based on ifconfig data (TX bytes and TX packets), the average packet size is about 11 K.
So my question is, where is my packet be segmented (from 128K to 11K)? How do I control/configure this in socket options or TCP options?
thanks!
===========EDIT==================
I have an application which can reach 8Gbps throughput in a 10G network with 32 TCP connections - in this case, the average packet size is about 20 Kbytes which is pretty good; but when I increased the TCP connections to 256, then the throughput is just about 1Gbps as the packet size on NIC is down to about 3 KBytes.
I know the packet size is critical to the performance as the cost of processing traffic is per packet, not per bytes, so the packet on NIC, it is better if bigger.
SO, MY QUESTION IS: how do I increase the TCP packet size? Is there any TCP settings control this?
Your question seems a little bit confusing, but there are a number of settings that you need to play with to get 10GigE to work right on Linux.
See here:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/
Setting the socket option SO_SNDBUF, SO_RCVBUF might help, but the TCP IP does not guarantees the chunks size when received / sent.

Why doesn't Linux IPSec implementation support fragmentation before encryption?

I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.
To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.

Making TCP dump without packets loss

How to make a TCP dump where it is guaranteed that all the packets that really pass through the network are captured, and nothing is missed?
Details:
We have an issue with 3rd party vendor who provides a solution on top of SCTP stack, which he also implements.
Under quite high throughput (52 000 messages/sec, average message size is 500 bytes) the SCTP link breaks.
We believe that the bug is in the vendor SCTP stack.
But the vendor says, this happens because SCTP stack sends a message, doesn't receive ACK on it, sends a number of retransmits, doesn't receive ACKs on them as well and closes the SCTP link.
So the vendor says, this is the network which is guilty, because it loses packets.
In the TCP dumps on both sides, client and server we see that the original messages reaches the server and see that the server doesn't answer with ACK. But the vendor says that TCP dumps are not reliable, that when capturing a TCP dump, some packets could be not captured, because libpcap library works only within one hardware thread, its power can be not enough to log all the packets.
Technical data:
52 000 messages/sec, average message size is 500 bytes, so 26 MB/sec in total, 4 SCTP links are used.
Hardware: CPU E5-2670, 2.6 GHz, 8 HW threads
Network: 10 GBit, the traffic is between HP blades, which are located in one rack.
RHEL 6.

Where are the missing TCP packets?

I observed a surprising thing that when there are both udp-based and tcp-based applications sending packets, if the upd-based application sent the packets so fast that the bandwith are nearly filled with udp packets, then the tcp packets would be very hard to send out.
The surprising thing is that though tcp-based application is able to send a few packets out (observed by the return value of write()), the receiver of the tcp packets never receives them. Why? Is that because the tcp-packets arenot finally sent out by the network card? or the tcp packets are actually dropped by routers?
Thanks,
Steve
First, the return value of write() is not an indicator of whether packets were sent. It just indicates that the data was buffered.
Second, if you are saturating the network with UDP packets there will be a lot of packet loss, and TCP being adaptive will adapt to that by sending packets out more slowly. If the packet loss gets too high TCP can basically stop altogether. The solution is not to saturate the network with UDP packets.
This is a simplified answer. There are many articles you can read up on.
UDP is a layer built upon IP. Ditto for TCP. The network card just sends out IP packets. You can look up the various structures for these packets.
TCP is a protocol that uses IP packets but uses a mechanism to try to ensure delivery and rearranges packets in the correct order. See the article on Wikipedia.
Routers are free to drop packets. This can occur when the network is overloaded, network connections are down or the IP packet is corrupted.
So to answer your question their is no preference between UDP or IP to be transmitted from one end to the other.

Resources