How to make a TCP dump where it is guaranteed that all the packets that really pass through the network are captured, and nothing is missed?
Details:
We have an issue with 3rd party vendor who provides a solution on top of SCTP stack, which he also implements.
Under quite high throughput (52 000 messages/sec, average message size is 500 bytes) the SCTP link breaks.
We believe that the bug is in the vendor SCTP stack.
But the vendor says, this happens because SCTP stack sends a message, doesn't receive ACK on it, sends a number of retransmits, doesn't receive ACKs on them as well and closes the SCTP link.
So the vendor says, this is the network which is guilty, because it loses packets.
In the TCP dumps on both sides, client and server we see that the original messages reaches the server and see that the server doesn't answer with ACK. But the vendor says that TCP dumps are not reliable, that when capturing a TCP dump, some packets could be not captured, because libpcap library works only within one hardware thread, its power can be not enough to log all the packets.
Technical data:
52 000 messages/sec, average message size is 500 bytes, so 26 MB/sec in total, 4 SCTP links are used.
Hardware: CPU E5-2670, 2.6 GHz, 8 HW threads
Network: 10 GBit, the traffic is between HP blades, which are located in one rack.
RHEL 6.
Related
I'm devleoping a multi rx threads ethernet driver, but this may lead to potential issuse that delivery out-of-order packets to linux network stack. this issue has been verified on PPTP connection, because GRE has sequence number and will drop out-of-order packets.
So, does TCP has an tcp reassembly queue or similar mechanism to process out-of-order segment.
TCP has a window buffer. As packets arrive they are cached until the next expected packet sequence number is received. When the next expected packet is received (and it's valid), it's passed onto the application for receiving in order.
see https://www.quora.com/How-does-TCP-handle-the-duplicate-segments-and-out-of-order-segments
I have written a C++ tool for my linux machine which receives UDP (OSC) packets and sends them back immediately (thats the only thing it does). But it seems that there is some amount of dropped packets. When I send 100 packets to my linux (from another machine), mostly only 64 packets are returned. I have looked at the incoming packets with tcpdump. It tells me the following:
64 packets captured
64 packets received by filter
0 packets dropped by kernel
So where are they?
UDP, by design, does not guarantee that the packets arrive at the destination. The packets missing might no have reached your machine at all, and thus will not appear in the incoming packets.
UDP is mostly used for streams and games, as loosing a few packets does not really matter.
If you want to be sure that all the packets arrive, you should use TCP.
Let me know if this helps.
I am working on a virtualization environment (Linux over HyperV). The Linux driver for the virtual NIC supports TSO and GSO (tcp segmentation is ON and generic segmentation is ON).
Now, I create TCP socket and the send buffer set to 128K.
But based on ifconfig data (TX bytes and TX packets), the average packet size is about 11 K.
So my question is, where is my packet be segmented (from 128K to 11K)? How do I control/configure this in socket options or TCP options?
thanks!
===========EDIT==================
I have an application which can reach 8Gbps throughput in a 10G network with 32 TCP connections - in this case, the average packet size is about 20 Kbytes which is pretty good; but when I increased the TCP connections to 256, then the throughput is just about 1Gbps as the packet size on NIC is down to about 3 KBytes.
I know the packet size is critical to the performance as the cost of processing traffic is per packet, not per bytes, so the packet on NIC, it is better if bigger.
SO, MY QUESTION IS: how do I increase the TCP packet size? Is there any TCP settings control this?
Your question seems a little bit confusing, but there are a number of settings that you need to play with to get 10GigE to work right on Linux.
See here:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/
Setting the socket option SO_SNDBUF, SO_RCVBUF might help, but the TCP IP does not guarantees the chunks size when received / sent.
I am trying to address an issue where ESP packets are getting fragmented as after adding the ESP header the MTU size is exceeded. The solution (which is what everybody does) is to do the fragmentation before doing the ESP encryption.
My question is if this is so useful, why doesn't the Linux IPSec implementation natively support it. I understand there would be certain limitations that L4 traffic selectors would not work. But not everybody makes use of it.
In addition, if you can share any pointers as to what's the best way to add this support, it would really help.
Thanks.
To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
It is a fixed length queue implemented as a linked-list of packets
Storing packet which are indexed by an integer
A packet is released when userspace issue a verdict to the corresponding index integer
When queue is full, no packet can be enqueued to it
Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
While creating queue, pass the callback function address, which is called, once the packet is queued for processing
NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.
I am trying to generate a series of packets to simulate the TCP 3-way handshake procedure, my first step is to capture the real connecting packets, and try to re-send the same packets from the same machine, but it didn't work at first.
finally I found it out that the packet I captured with tcpdump is not exactly what my computer sent out, the TCP's checksum field is changed and it lead me to thinkk that I can establish a tcp connection even the TCP checksum is incorrect.
so my question is how is the checksum field calculated? is it modified by tcpdump or hardware? why is it changed? Is it a bug of tcpdump? or it's because the calculation is omitted.
the following is the screenshot I captured from my host machine and a virtual machinne, you can see that the same packet captured on differnet machine are all the same except for the TCP checksum.
and the small window is my virtual machine, I used command "ssh 10.82.25.138" from the host to generate these packets
What you are seeing may be the result of checksum offloading. To quote from the wireshark wiki (http://wiki.wireshark.org/CaptureSetup/Offloading):
Most modern operating systems support some form of network offloading,
where some network processing happens on the NIC instead of the CPU.
Normally this is a great thing. It can free up resources on the rest
of the system and let it handle more connections. If you're trying to
capture traffic it can result in false errors and strange or even
missing traffic.
On systems that support checksum offloading, IP, TCP, and UDP
checksums are calculated on the NIC just before they're transmitted on
the wire. In Wireshark these show up as outgoing packets marked black
with red Text and the note [incorrect, should be xxxx (maybe caused by
"TCP checksum offload"?)].
Wireshark captures packets before they are sent to the network
adapter. It won't see the correct checksum because it has not been
calculated yet. Even worse, most OSes don't bother initialize this
data so you're probably seeing little chunks of memory that you
shouldn't.
Although this is for wireshark, the same principle applies. In your host machine, you see the wrong checksum because it just hasn't been filled in yet. It looks right on the guest, because before it's sent out on the "wire" it is filled in. Try disabling checksum offloading on the interface which is handling this traffic, e.g.:
ethtool -K eth0 rx off tx off
if it's eth0.