Is it worthwhile to increase TCP MTU and how does Dont Fragment flag works - linux

I have a custom protocol overlaying TCP that can be described as follows:-
Client sends a packet A to the server. The server ACKS the packet A.
Client sends a packet B.
In other words, at any point in time, there is only one unacknowledged packet. Hence, the factors that play into account for sending messages as fast as possible are:-
How soon can a packet arrive at the destination. This implies least amount of fragmentation
done by TCP. If a packet can arrive in a single segment as opposed to 5 segments, the quicker the server can respond to it.
The unit of work done by the server for that packet. At present, i am not focused on this point, though eventually, i will touch it also.
Also assume, the rate of loss is negligible.
Nagle is disabled.
Typical packet sizes vary from 1KB to 3KB.
Bandwidth is 1Gb/sec
I am thinking that if i configure MTU equal to the biggest message size (3KB + headers), this should impact the number of messages that i can send in a second. My question is that are there any negative consequences in changing MTU. This application runs inside the LAN in a managed environment.
Alternatively, if i set the don't fragment flag, would it be equivalent to the above change?

First, let's clarify the difference between MTU and MSS. These belong to different layers of the stack (2 and 3).
TCP/IP is a quite unfortunate layered cake, both of which support fragmentation but differently, and they do not cooperate on this matter.
IP fragmentation is something that TCP is unaware of. In fact, if one of the IP fragments lost, the whole series is declared lost. Not so is for TCP: if one of the IP datagrams which are part of the same TCP stream is lost, and they were fragmented by the TCP, only retransmit of the lost parts is required.
The core reason for this mess is that a router must be able to impedance-match between two physical networks with different MTUs without understanding the higher (TCP) protocol.
Now, all modern networks support "jumbo frames" (you have to configure your NIC to be able to send jumbo frames; all modern NICs will always be able to receive frames up to 90xx bytes).
As usual with increasing MTU, it is
not useful unless you increase MSS
improves performance (bandwidth), and
hurts performance (zero-load latency to first byte)
In some applications, like, for example, Gigalinx implementation og GigE vision, increasing MTU is a requirement. Over fast networks the overhead of 1500 byte MTU is intolerable.
As an architect, the thing to ask yourself is what is your application actually doing. If there is a "relevant packet size", in sense that "until first 3kB of data received there's nothing to do with the rest", and you really need this tiny performance edge, increase MTU. Before doing that, consider dropping TCP altogether in favor of more ethernet-friendly protocol, and of course do not implement it yourself but choose something like ZeroMQ which works good.
Second question: Do not fragment is an IP setting. Typically useful only in routers, which are expected to match networks of different MTU. It means "discard packet unless I can relay it to the other network". If this is sometimes the case, TCP cannot work over this layer. It will try to retransmit and fail again and again, and eventually disconnect and further behavior will depend on what application is doing. This is a typical situation on the internet, with public misconfigured wifi networks and home networks. You can sometimes browse facebook but not practically watch anything on youtube. This is why. Network administrators would never know the reason.

MSS = Maximum segment size = the amount of data sent in one TCP packet.
Decreasing the MSS will reduce performance as the data will be fragmented to more TCP packets.
Increasing the MSS past its correct value will cause fragmentation on the link layer (ethernet).
TCP already tries to find (per-connection) the largest possible MSS that doesn't cause fragmentation. Unless this fails (it doesn't), there's no need to override this value. Link-layer fragmentation should be avoided. It can save very little and it can easily hurt the performance as well.
Don't touch MSS unless you know what you're doing. It has its value for a good reason.

Related

How intrusive is tcpdump?

I look around to find a documentation on tcpdump internals, but I did not found nothing yet. So my question is how intrusive is tcpdump on a computer. How to evaluate the amount of resources (memory or cpu) dedicated for the analysis of the traffic?
tcpdump is very simple tool which is basically opens special type of socket
socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL))
and writes to disk everything it gets.
Kernel does all capture and manages special buffer to store packets for tcpdump. If buffer is full packet it just dropped. Buffer is regulated with -B option. Most systems has upper limit for buffer, ~2GB or something like that.
From CPU standpoint you need computation power copy all data 2 or 3 times, this usually is not a problem, if you unable to capture 1GB link you most probable should blame disk speed, not CPU. For 10Gb link it could be CPU problems, and memory bus bandwidth problems, and you may need some optimisations for this.
As far as I read, tcpdump consume it's kinda variable depending what you're asking.
To see how many resources your tcpdump process consumes just watch system monitor like top Top Manual.
tcpdump output can be considerable if the network traffic your
expression defines is high bandwidth; particularly if you are
capturing more than the default 68 Bytes of packet content.
Capturing packets, for example, related to a large file transfer or a
web server being actively used by hundreds or thousands of clients
will produce an overwhelming amount of output. If writing this output
to stdout you will probably be unable to enter commands in your
terminal, if writing to a file you may exhaust the host’s disk space.
In either case tcpdump is also likely to consume a great deal of CPU
and memory resources.
To avoid these issues;
Be very careful when specifying expressions and try to make them as specific as possible.
Don’t capture during times of heavy traffic/load.
If you wish to capture entire packet contents, do a test capture only capturing the default 68Bytes first and make a judgement on
whether the system will cope with the full packet content capture.
Where writing to disk, carefully monitor the size of the file and make sure the host in question has the likely disk resources
required available, or use the -c parameter to limit the number of
packets captured.
Never use an expression that would capture traffic to or from your remote telnet/SSH/whatever terminal/shell. tcpdump output
would generate traffic to your terminal, resulting in further
output, resulting in more traffic to your terminal and so on in
an infinite and potentially harmful feedback loop.
Origin : Tcpdump - Basics

How to cope with 320 million 272-byte UDP packets?

So, I have an incoming UDP stream composed of 272 byte packets at a data rate of about 5.12Gb/s (around 320e6 packets per second). This data is being sent by an FPGA-based custom board. The packet size is a limit of the digital design being run, so although theoretically it could be possible to increase it to make things more efficient, it would require a large amount of work. At the receiving end these packets are read and interpreted by a network thread and placed in a circular buffer shared with a buffering thread, which will copy this data to a GPU for processing.
The above setup at the receiving end could cope with 5.12Gb/s for 4096 KB packet (used on a different design) using simple recv calls, however with the current packet size I'm having a hard time keeping up with the packet flow, too much time is being "wasted" in context switching and copying small data segments from kernel space to user space. I did a quick test implementation which uses recvmmsg, however thing didn't improve by much. On average I can processes about 40% of the incoming packets.
So I was wondering whether it was possible to get a handle of the kernel's UDP data buffer for my application (mmap style), or use some sort of zero-copying from kernel to user space?
Alternatively, do you know of any other method which would reduce this overhead and be capable of performing the required processing?
This is running on a Linux machine (kernel 3.2.0-40) using C code.
There is support for mmap packet receiving in Linux.
It's not so easy to use as UDP sockets, because you will receive packets like from RAW socket.
See this for more information.

Can you optmize/configure TCP inside your controlled network so it becomes as fast as UDP?

I am considering writing my own implementation of reliable UDP (packet ordering and retransmission for packet drops). This is for internal systems inside my controlled network. I wonder if it is possible on a Linux system to optimize TCP so much that it becomes as fast as UDP? If it is I will just use super-optimized TCP and not worry about implementing reliable UDP.
There are a few things that you can do to adapt TCP to your specific needs. You can increase the maximum buffer size, change the congestion algorithm and much more. I think it's really worth trying before re-inventing the wheel, especially since you seem to have a good control over your internal network. This article describe a few of those things. Another good source of information for all those parameters is Documentation/networking/ip-sysctl.txt in your Linux source code.
You can also check that the TCP Selective Acknowledgement (SACK) is enabled on your Linux by looking at /proc/sys/net/ipv4/tcp_sack.
There are different measures of performance (throughput, latency, jitter, robustness to packet drop).
You can't get packet delivery and ordering guarantees without introducing arbitrary delay. In that sense, reliable UDP cannot be "faster" than TCP.
It's possible to use FEC to reduce the number of retransmits, at the expense of additional bandwidth. That would only improve throughput over a very lossy link though. It will reduce jitter and maximum latency over a link with any drops.
Or are you talking about TCP fairness vs UDP greediness?
In any case, your OS TCP layer is already very highly optimized, it's unlikely you can do better without changing the rules of the game (e.g. sacrificing fairness).

What happens after a packet is captured?

I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions

Where are possible locations of queueing/buffering delays in Linux multicast?

We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity.
The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal.
The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()), then call send(). The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.)
We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it.
For what it's worth, we're using CentOS 4.x (RHEL kernel 2.6.9).
This is a great question. On CentOS like most flavors of *nix there is a UDP receive/send buffer for every multicast socket. The size of this buffer is controlled by sysctl.conf you can view the size of your buffers by calling /sbin/sysctl -a
The below items show my default and max udp receive size in bytes. The larger these numbers the more buffering and therefor latency the network/kernel can introduce if your application is too slow in consuming the data. If you have built in good tolerance for data loss you can make these buffers very tiny and you will not see the latency build up and recovery you described above. The trade off is data loss as the buffer overflows - something you may be seeing already.
[~]$ /sbin/sysctl -a | mem
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
In most cases you need to set default = to your max unless you are controlling this when you create your socket.
the last thing you can do (depending on your kernel version) is view the UDP stats of the PID for your process or at the very least the box overall.
cat /proc/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
cat /proc/PID/net/snmp | grep -i Udp
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 81658157063 145 616548928 3896986
If it wasn't clear from my post, the latency is due to your application not consuming the data fast enough and forcing the kernel to buffer traffic in the above structure. The network, kernel, and even your network card ring buffers can play a roll in latency but all those items typically only add a few milliseconds.
Let me know your thoughts and I can give you more information on where to look in your app to squeeze some more performance.
Packets can queue up in the send and receive side kernel, the NIC and the networking infrastructure. You will find a plethora of items you can test and tweak.
For the NIC you can usually find interrupt coalescing parameters - how long the NIC will wait before notifying the kernel or sending to the wire whilst waiting to batch packets.
For Linux you have the send and receive "buffers", the larger they are the more likely you are to experience higher latency as packets get handled in batched operations.
For the architecture and Linux version you have to be aware of how expensive context switches are and whether there are locks or pre-emptive scheduling enabled. Consider minimizing the number of applications running, using process affinity to lock processes to particular cores.
Don't forget timing, the Linux kernel version you are using has pretty terrible accuracy on the gettimeofday() clock (2-4ms) and is quite an expensive call. Consider using alternatives such as reading from the core TSC or an external HPET device.
Diagram from Intel:
alt text http://www.theinquirer.net/IMG/142/96142/latency-580x358.png?1272514422
If you decide you need to capture packets in the production environment, it may be worth looking at using monitor ports on your switches and capture the packets using non-production machines. That'll also allow you to capture the packets on multiple points across the transmission path and compare what you're seeing.

Resources