Less throughput with small packets

Less throughput with small packets - linux

I have a question why the throughput of my machine is very bad with a SMALL sized packet (i.e 64bytes) when compared with the packet sized 1500bytes?
I am having a GIGABIT NIC card and able to transmit at 80MB/s for 1500bytes sized packets but in the case 64bytes sized packet I can hardly make out around 25MB/s.
I know that in the case of 1500byte packets I need to send around 80k PPS to reach line rate and for 64bytes its around 1.4 million PPS.
But why there is a huge variation in throughput for small sized packets ??
EDIT: I am using memory mapping to transmit the packets from user-space to kernel-space in linux and then directly writing into the network driver to transmit. And I see my CPU utilization is very less and same when compared between 64bytes and 1500bytes packets.

But why there is a huge variation in throughput for small sized
packets ??
CPU strain. Independent of its size, each packet that gets out passes through a lot of processing before reaching the interface. Put another way, the "costs" of transmitting a small packet and a large packet are comparable.
If you're interested in this you might want to look into "GSO" and "UFO" in the Linux kernel - it was developed specifically for this.

It takes time to send packet headers. It takes time to setup DMA buffers, process packet headers, etc. All that extra work reduces the amount of actual payload that can be sent.

Think about this: each packet has its header contains the size of payload(data) and some general data. lets say the header are 16 bytes.
If you send 1000 packets of 64 bytes you send 1000 * (64 + 16) = 64000 + 16000 bytes.
If you send it in one shot it is only 64000+16 bytes.

Related

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.

To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.

When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).

TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

TCP Sockets send buffer size efficiency

When working with WinSock or POSIX TCP sockets (in C/C++, so no extra Java/Python/etc. wrapping), is there any efficiency pro/cons to building up a larger buffer (e.g. say upto 4KB) in user space then making as few calls to send as possible to send that buffer vs making multiple smaller calls directly with the bits of data (say 1-1000 bytes), other the the fact that for non-blocking/asynchronous sockets the single buffer is potentially easier for me to manage.
I know with recv small buffers are not recommended, but I couldn't find anything for sending.
e.g. does each send call on common platforms go to into kernel mode? Could a 1 byte send actually result in a 1 byte packet being transmitted under normal conditions?

As explained on TCP Illustrated Vol I, by Richard Stevens, TCP divides the send buffer in near to optimum segments to fit in the maximum packet size along the path to the other TCP peer. That means that it will never try to send segments that will be fragmented by ip along the route to destination (when a packet is fragmented at some ip router, it sends back an IP fragmentation ICMP packet and TCP will take it into account to reduce the MSS for this connection). That said, there is no need for larger buffer than the maximum packet size of the link level interfaces you'll have along the path. Having one, let's say, twice or thrice longer, makes you sure that TCP will not stop sending as soon as it receives some acknowledge of remote peer, because of not having its buffer filled with data.
Think that the normal interface type is ethernet and it has a maximum packet size of 1500 bytes, so normally TCP doesn't send a segment greater than this size. And it normally has an internall buffer of 8Kb per connection, so there's little sense in adding buffer size at kernel space for that (if this is the only reason to have a buffer in kernel space).
Of course, there are other factors that force you to use a buffer in user space (for example, you want to store the data to send to your peer process somewhere, as there's only 8Kb data in kernel space to buffer, and you will need more space to be able to do some other processes) An example: ircd (the Internet Relay Chat daemon) uses write buffers of up to 100Kb before dropping a connection because the other side is not receiving/acknowledging that data. If you only write(2) to the connection, you'll be put on wait once the kernel buffer is full, and perhaps that's not what you want.
The reason to have buffers in user space is because TCP makes also flow control, so when it's not able to send data, it has to be put somewhere to cope with it. You'll have to decide if you need your process to save that data up to a limit or you can block sending data until the receiver is able to receive again. The buffer size in kernel space is limited and normally out of control for the user/developer. Buffer size in user space is limited only by the resources allowable to it.
Receiving/sending small chunks of data in a TCP connection is not recommendable because of the increased overhead of TCP handshaking and headers impose. Suppose a telnet connection in which for each character sent, a header for TCP and other for IP is added (20 bytes min for TCP, 20 bytes min for IP, 14 bytes for ethernet frame and 4 for the ethernet CRC) makes up to 60 bytes+ to transmit only one character. And normally each tcp segment is acknowledged individually, so that makes a full roundtrip time to send a segment and get the acknowledge (just to be able to free the buffer resources and assume this character as transmitted)
So, finally, what's the limit? It depends on your application. If you can cope with the kernel resources available and don't need more buffers, you can pass without havin buffers in user space. If you need more, you'll need to implement buffers and be able to feed the kernel buffer with your buffer data when available.

Yes, a one byte send can - under very normal conditions - result in sending a TCP packet with only a single byte payload. Send coalescing in TCP is normally done by use of Nagle's algorithm. With Nagle's algorithm, sending data is delayed iff there is data that has already been sent but not yet acknowledged.
Conversely data will be sent immediately if there is no unacknowledged data. Which is usually true in the following situations:
The connection has just been opened
The connection has been idle for some time
The connection only received data but nothing was sent for some time
In that case the first send call that your application performs will cause a packet to be sent immediately, no matter how small. So starting communication with two or more small sends is usually a bad idea because it increases overhead and delay.
The infamous "send send recv" pattern can also cause really large delays (e.g. on Windows typically 200ms). This happens if the local TCP stack uses Nagle's algorithm (which will usually delay the second send) and the remote stack uses delayed acknowledgment (which can delay the acknowledgment of the first packet).
Since most TCP stack implementations use both, Nagle's algorithm and delayed acknowledgment, this pattern should best be avoided.

Is zero-copy UDP packing receiving possibly on Linux?

I would like to have UDP packets copied directly from the ethernet adapter into my userspace buffer
Some details on my setup:
I am receiving data from a pair of gigabit ethernet cameras. Combined I am receiving 28800 UDP packets per second (1 packet per line * 30FPS * 2 cameras * 480 lines). There is no way for me to switch to jumbo frames, and I am already looking into tuning driver level interrupts for reduced CPU utilization. What I am after here is reducing the number of times I am copying this ~40MB/s data stream.
This is the best source I have found on this, but I was hoping there was a more complete reference or proof that such an approach worked out in practice.

This article may be useful:
http://yusufonlinux.blogspot.com/2010/11/data-link-access-and-zero-copy.html

Your best avenues are recvmmsg and increasing RX interrupt coalescing.
http://lwn.net/Articles/334532/
You can move lower and match how Wireshark/tcpdump operate but it becomes futile to attempt any serious processing above it having to decode everything yourself.
At only 30,000 packets per second I wouldn't worry too much about copying packets, those problems arise when dealing with 3,000,000 messages per second.

UDP IP Fragmentation and MTU

I'm trying to understand some behavior I'm seeing in the context of sending UDP packets.
I have two little Java programs: one that transmits UDP packets, and the other that receives them. I'm running them locally on my network between two computers that are connected via a single switch.
The MTU setting (reported by /sbin/ifconfig) is 1500 on both network adapters.
If I send packets with a size < 1500, I receive them. Expected.
If I send packets with 1500 < size < 24258 I receive them. Expected. I have confirmed via wireshark that the IP layer is fragmenting them.
If I send packets with size > 24258, they are lost. Not Expected. When I run wireshark on the receiving side, I don't see any of these packets.
I was able to see similar behavior with ping -s.
ping -s 24258 hostA works but
ping -s 24259 hostA fails.
Does anyone understand what may be happening, or have ideas of what I should be looking for?
Both computers are running CentOS 5 64-bit. I'm using a 1.6 JDK, but I don't really think it's a programming problem, it's a networking or maybe OS problem.

Implementations of the IP protocol are not required to be capable of handling arbitrarily large packets. In theory, the maximum possible IP packet size is 65,535 octets, but the standard only requires that implementations support at least 576 octets.
It would appear that your host's implementation supports a maximum size much greater than 576, but still significantly smaller than the maximum theoretical size of 65,535. (I don't think the switch should be a problem, because it shouldn't need to do any defragmentation -- it's not even operating at the IP layer).
The IP standard further recommends that hosts not send packets larger than 576 bytes, unless they are certain that the receiving host can handle the larger packet size. You should maybe consider whether or not it would be better for your program to send a smaller packet size. 24,529 seems awfully large to me. I think there may be a possibility that a lot of hosts won't handle packets that large.
Note that these packet size limits are entirely separate from MTU (the maximum frame size supported by the data link layer protocol).

I found the following which may be of interest:
Determine the maximum size of a UDP datagram packet on Linux
Set the DF bit in the IP header and send continually larger packets to determine at what point a packet is fragmented as per Path MTU Discovery. Packet fragmentation should then result in a ICMP type 3 packet with code 4 indicating that the packet was too large to be sent without being fragmented.
Dan's answer is useful but note that after headers you're really limited to 65507 bytes.

What are the units of UDP buffers, and where are docs for sysctl params?

I'm running x86_64 RedHat 5.3 (kernel 2.6.18) and looking specifically at net.core.rmem_max from sysctl -a in the context of trying to set UDP buffers. The receiver application misses packets sometimes, but I think the buffer is already plenty large, depending upon what it means:
What are the units of this setting -- bits, bytes, packets, or pages? If bits or bytes, is it from the datagram/ payload (such as 100 bytes) or the network MTU size (~1500 bytes)? If pages, what's the page size in bytes?
And is this the max per system, per physical device (NIC), per virtual device (VLAN), per process, per thread, per socket/ per multicast group?
For example, suppose my data is 100 bytes per message, and each network packet holds 2 messages, and I want to be able to buffer 50,000 messages per socket, and I open 3 sockets per thread on each of 4 threads. How big should net.core.rmem_max be? Likewise, when I set socket options inside the application, are the units payload bytes, so 5000000 on each socket in this case?
Finally, in general how would I find details of the units for the parameters I see via sysctl -a? I have similar units and per X questions about other parameters such as net.core.netdev_max_backlog and net.ipv4.igmp_max_memberships.
Thank you.

You'd look at these docs. That said, many of these parameters really are quite poorly documented, so do expect do do som googling to dig out the gory details from blogs and mailinglists.
rmem_max is the per socket maximum buffer, in bytes. Digging around, this appears to be the memory where whole packets are received, so the size have to include the sizes of whatever/ip/udp headers as well - though this area is quite fuzzy to me.
Keep in mind though, UDP is unreliable. There's a lot of sources for loss, not the least inbetween switches and routers - these have buffers as well.

It is fully documented in the socket(7) man page (it is in bytes).
Moreover, the limit may be set on a per-socket basis with SO_RCVBUF (as documented in the same page).
Read the socket(7), ip(7) and udp(7) man pages for information on how these things actually work. The sysctls are documented there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string