TCP_NODELAY parameter implication on MSS and CPU utilisation

TCP_NODELAY parameter implication on MSS and CPU utilisation - linux

While I was trying to compose test-suite using netperf I had to go through the manual where I came across the below line for option -D
If setting TCP_NODELAY with -D affects throughput and/or service demand for tests where the send size (-m) is larger than the MSS it suggests the TCP/IP stack’s implementation of the Nagle Algorithm may be broken, perhaps interpreting the Nagle Algorithm on a segment by segment basis rather than the proper user send by user send basis
My question is - how can setting TCP_DELAY and it's effect on throughput determines nagle implementation is broken? Can someone help me with a logical explanation on the same?

In broad terms Nagle goes something like:
Is the amount of data in this send, added to any data queued waiting
to be sent, larger than the Maximum Segment Size (MSS) of this
connection? If the answer is yes, then send the data now, so long
as there are no other constraints on sending such as the congestion
window or the receiver’s advertised window. If the answer is no,
then consider the next step:
Is the connection “idle?” In other
words, is there no un-ACKnowledged data out on the network. If so,
then send the data in the user’s send now. If the connection is not
idle, queue the data and wait. We know that one or more of three
things will happen to cause the data to be sent:
The application will continue making send calls and we get to an MSS-worth of data to send.
An ACKnowledgement will arrive from the remote for the currently un-ACKed data making the connection “idle” again.
The retransmission timer for the un-ACKed data will expire and we might piggy-back this data with the retransmission.
The main point being Nagle is supposed to be evaluated on a user-send by user-send basis. If disabling it results in higher throughput with a send size greater than the MSS, it implies the Nagle implementation of that stack is going TCP segment by TCP segment instead.

Related

maximizing linux raw socket throughput

I have a simple application which receives packets of fixed ethertype via raw socket (the transport is ethernet), and sends two duplicates over another interface (via raw socket):
recvfrom() //blocking
//make duplicate
//add tail
sendto(packet1);
sendto(packet2);
I want two increase throughput. I need at least 4000 frames/second, can't change packet size. How can I achieve these? The system is embedded (AM335x SoC), kernel is 4.14.40... How can I encrease the performance?

A few considerations:
I know you said you can't change "packet size", but using buffered writes and infrequent flushes might really help performance
You could enable jumbo frames
You could disable Nagle
Usually people don't want to change their packet sizes, because they expect a 1-to-1 correspondence between their send()'s and recv()'s. This is not a good thing, because TCP specifically does not ensure that your send()'s and recv()'s will have a 1-1 correspondence. They usually will be 1-1, but they are not guaranteed to do so. Transmitting data with Nagle enabled or over many router hops or without Path MTU Discovery enabled makes the 1-1 relationship less likely.
So if you use buffering, and frame your data somehow (EG 1: terminate messages with a nul byte, if your data cannot otherwise have a nul or EG 2: transfer lengths as network shorts or something, so you know how much to read), you'll likely be killing two birds with one stone - that is, you'll get better speed and better reliability.

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.

To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.

When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).

TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

TCP Sockets send buffer size efficiency

When working with WinSock or POSIX TCP sockets (in C/C++, so no extra Java/Python/etc. wrapping), is there any efficiency pro/cons to building up a larger buffer (e.g. say upto 4KB) in user space then making as few calls to send as possible to send that buffer vs making multiple smaller calls directly with the bits of data (say 1-1000 bytes), other the the fact that for non-blocking/asynchronous sockets the single buffer is potentially easier for me to manage.
I know with recv small buffers are not recommended, but I couldn't find anything for sending.
e.g. does each send call on common platforms go to into kernel mode? Could a 1 byte send actually result in a 1 byte packet being transmitted under normal conditions?

As explained on TCP Illustrated Vol I, by Richard Stevens, TCP divides the send buffer in near to optimum segments to fit in the maximum packet size along the path to the other TCP peer. That means that it will never try to send segments that will be fragmented by ip along the route to destination (when a packet is fragmented at some ip router, it sends back an IP fragmentation ICMP packet and TCP will take it into account to reduce the MSS for this connection). That said, there is no need for larger buffer than the maximum packet size of the link level interfaces you'll have along the path. Having one, let's say, twice or thrice longer, makes you sure that TCP will not stop sending as soon as it receives some acknowledge of remote peer, because of not having its buffer filled with data.
Think that the normal interface type is ethernet and it has a maximum packet size of 1500 bytes, so normally TCP doesn't send a segment greater than this size. And it normally has an internall buffer of 8Kb per connection, so there's little sense in adding buffer size at kernel space for that (if this is the only reason to have a buffer in kernel space).
Of course, there are other factors that force you to use a buffer in user space (for example, you want to store the data to send to your peer process somewhere, as there's only 8Kb data in kernel space to buffer, and you will need more space to be able to do some other processes) An example: ircd (the Internet Relay Chat daemon) uses write buffers of up to 100Kb before dropping a connection because the other side is not receiving/acknowledging that data. If you only write(2) to the connection, you'll be put on wait once the kernel buffer is full, and perhaps that's not what you want.
The reason to have buffers in user space is because TCP makes also flow control, so when it's not able to send data, it has to be put somewhere to cope with it. You'll have to decide if you need your process to save that data up to a limit or you can block sending data until the receiver is able to receive again. The buffer size in kernel space is limited and normally out of control for the user/developer. Buffer size in user space is limited only by the resources allowable to it.
Receiving/sending small chunks of data in a TCP connection is not recommendable because of the increased overhead of TCP handshaking and headers impose. Suppose a telnet connection in which for each character sent, a header for TCP and other for IP is added (20 bytes min for TCP, 20 bytes min for IP, 14 bytes for ethernet frame and 4 for the ethernet CRC) makes up to 60 bytes+ to transmit only one character. And normally each tcp segment is acknowledged individually, so that makes a full roundtrip time to send a segment and get the acknowledge (just to be able to free the buffer resources and assume this character as transmitted)
So, finally, what's the limit? It depends on your application. If you can cope with the kernel resources available and don't need more buffers, you can pass without havin buffers in user space. If you need more, you'll need to implement buffers and be able to feed the kernel buffer with your buffer data when available.

Yes, a one byte send can - under very normal conditions - result in sending a TCP packet with only a single byte payload. Send coalescing in TCP is normally done by use of Nagle's algorithm. With Nagle's algorithm, sending data is delayed iff there is data that has already been sent but not yet acknowledged.
Conversely data will be sent immediately if there is no unacknowledged data. Which is usually true in the following situations:
The connection has just been opened
The connection has been idle for some time
The connection only received data but nothing was sent for some time
In that case the first send call that your application performs will cause a packet to be sent immediately, no matter how small. So starting communication with two or more small sends is usually a bad idea because it increases overhead and delay.
The infamous "send send recv" pattern can also cause really large delays (e.g. on Windows typically 200ms). This happens if the local TCP stack uses Nagle's algorithm (which will usually delay the second send) and the remote stack uses delayed acknowledgment (which can delay the acknowledgment of the first packet).
Since most TCP stack implementations use both, Nagle's algorithm and delayed acknowledgment, this pattern should best be avoided.

Application control of TCP retransmission on Linux

For the impatient:
How to change the value of /proc/sys/net/ipv4/tcp_retries2 for a single connection in Linux, using setsockopt(), ioctl() or such, or is it possible?
Longer decription:
I'm developing an application that uses long polling HTTP requests. On the server side, it needs to be known when the client has closed the connection. The accuracy is not critical, but it certainly cannot be 15 minutes. Closer to a minute would do fine.
For those not familiar with the concept, a long polling HTTP request works like this:
The client sends a request
The server responds with HTTP headers, but leaves the response open. Chunked transfer encoding is used, allowing the server to sends bits of data as they become available.
When all the data is sent, the server sends a "closing chunk" to signal that the response is complete.
In my application, the server sends "heartbeats" to the client every now an then (30 seconds by default). A heartbeat is just a newline character that is sent as a response chunk. This is meant to keep the line busy so that we notify the connection loss.
There's no problem when the client shuts down correctly. But when it's shut down with force (the client machine loses power, for example), a TCP reset is not sent. In this case, the server sends a heartbeat, which the client doesn't ACK. After this, the server keeps retransmitting the packet for roughly 15 minutes after giving up and reporting the failure to the application layer (our HTTP server). And 15 minutes is too long a wait in my case.
I can control the retransmission time by writing to the following files in /proc/sys/net/ipv4/:
tcp_retries1 - INTEGER
This value influences the time, after which TCP decides, that
something is wrong due to unacknowledged RTO retransmissions,
and reports this suspicion to the network layer.
See tcp_retries2 for more details.
RFC 1122 recommends at least 3 retransmissions, which is the
default.
tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.
The default value of tcp_retries2 is indeed 8, and my experience of 15 minutes (900 seconds) of retransmission is in line with the kernel documentation quoted above.
If I change the value of tcp_retries2 to 5 for example, the connection dies much more quicker. But setting it like this affects all the connections in the system, and I'd really like to set it for this one long polling connection only.
A quote from RFC 1122:
4.2.3.5 TCP Connection Failures
Excessive retransmission of the same segment by TCP
indicates some failure of the remote host or the Internet
path. This failure may be of short or long duration. The
following procedure MUST be used to handle excessive
retransmissions of data segments [IP:11]:
(a) There are two thresholds R1 and R2 measuring the amount
of retransmission that has occurred for the same
segment. R1 and R2 might be measured in time units or
as a count of retransmissions.
(b) When the number of transmissions of the same segment
reaches or exceeds threshold R1, pass negative advice
(see Section 3.3.1.4) to the IP layer, to trigger
dead-gateway diagnosis.
(c) When the number of transmissions of the same segment
reaches a threshold R2 greater than R1, close the
connection.
(d) An application MUST be able to set the value for R2 for
a particular connection. For example, an interactive
application might set R2 to "infinity," giving the user
control over when to disconnect.
(e) TCP SHOULD inform the application of the delivery
problem (unless such information has been disabled by
the application; see Section 4.2.4.1), when R1 is
reached and before R2. This will allow a remote login
(User Telnet) application program to inform the user,
for example.
It seems to me that tcp_retries1 and tcp_retries2 in Linux correspond to R1 and R2 in the RFC. The RFC clearly states (in item d) that a conforming implementation MUST allow setting the value of R2, but I have found no way to do it using setsockopt(), ioctl() or such.
Another option would be to get a notification when R1 is exceeded (item e). This is not as good as setting R2, though, as I think R1 is hit pretty soon (in a few seconds), and the value of R1 cannot be set per connection, or at least the RFC doesn't require it.

Looks like this was added in Kernel 2.6.37.
Commit diff from kernel Git and Excerpt from change log below;
commit
dca43c75e7e545694a9dd6288553f55c53e2a3a3
Author: Jerry Chu
Date: Fri Aug 27 19:13:28 2010 +0000
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu#google.com>
Signed-off-by: David S. Miller <davem#davemloft.net>

I suggest that if the TCP_USER_TIMEOUT socket option described by Kimvais is available, you use that. On older kernels where that socket option is not present, you could repeatedly call the SIOCOUTQ ioctl() to determine the size of the socket send queue - if the send queue doesn't decrease over your timeout period, that indicates that no ACKs have been received and you can close the socket.

After some thinking (and googling), I came to the conclusion that you can't change tcp_retries1 and tcp_retries2 value for a single socket unless you apply some sort of patch to the kernel. Is that feasible for you?
Otherways, you could use TCP_KEEPALIVE socket option whose purpose is to check if a connection is still active (it seems to me that that's exactly what you are trying to achieve, so it has sense). Pay attention to the fact that you need to tweak its default parameter a little, because the default is to disconnect after about 2 hrs!!!

This is for my understanding.
tcp_retries2 is the number of retransmission that is permitted by the system before droping the conection.So if we want to change the default value of tcp_retries2 using TCP_USER_TIMEOUT which specifies the maximum amount of time transmitted data may remain unacknowledged, we have to increase the value of TCP_USER_TIMEOUT right?
In that case the conction will wait for a longer time and will not retransmit the data packet.
Please correct me, if something is wrong.

int name[] = {CTL_NET, NET_IPV4, NET_IPV4_TCP_RETRIES2};
long value = 0;
size_t size = sizeof(value);
if(!sysctl(name, sizeof(name)/sizeof(name[0]), &value, &size, NULL, 0) {
value // It contains current value from /proc/sys/net/ipv4/tcp_retries2
}
value = ... // Change value if it needed
if(!sysctl(name, sizeof(name)/sizeof(name[0]), NULL, NULL, &value, size) {
// Value in /proc/sys/net/ipv4/tcp_retries2 changed successfully
}
Programmatically way using C. It works at least on Ubuntu. But (according with code and system variables) looks like it influences on all TCP connections in system, not only one single connection.

I need a TCP option (ioctl) to send data immediately

I've got an unusual situation: I'm using a Linux system in an embedded situation (Intel box, currently using a 2.6.20 kernel.) which has to communicate with an embedded system that has a partially broken TCP implementation. As near as I can tell right now they expect each message from us to come in a separate Ethernet frame! They seem to have problems when messages are split across Ethernet frames.
We are on the local network with the device, and there are no routers between us (just a switch).
We are, of course, trying to force them to fix their system, but that may not end up being feasible.
I've already set TCP_NODELAY on my sockets (I connect to them), but that only helps if I don't try to send more than one message at a time. If I have several outgoing messages in a row, those messages tend to end up in one or two Ethernet frames, which causes trouble on the other system.
I can generally avoid the problem by using a timer to avoid sending messages too close together, but that obviously limits our throughput. Further, if I turn the time down too low, I risk network congestion holding up packet transmits and ending up allowing more than one of my messages into the same packet.
Is there any way that I can tell whether the driver has data queued or not? Is there some way I can force the driver to send independent write calls in independent transport layer packets? I've had a look through the socket(7) and tcp(7) man pages and I didn't find anything. It may just be that I don't know what I'm looking for.
Obviously, UDP would be one way out, but again, I don't think we can make the other end change anything much at this point.
Any help greatly appreciated.

IIUC, setting the TCP_NODELAY option should flush all packets (i.e. tcp.c implements setting of NODELAY with a call to tcp_push_pending_frames). So if you set the socket option after each send call, you should get what you want.

You cannot work around a problem unless you're sure what the problem is.
If they've done the newbie mistake of assuming that recv() receives exactly one message then I don't see a way to solve it completely. Sending only one message per Ethernet frame is one thing, but if multiple Ethernet frames arrive before the receiver calls recv() it will still get multiple messages in one call.
Network congestion makes it practically impossible to prevent this (while maintaining decent throughput) even if they can tell you how often they call recv().

Maybe, set TCP_NODELAY and set your MTU low enough so that there would be at most 1 message per frame? Oh, and add "dont-fragment" flag on outgoing packets

Have you tried opening a new socket for each message and closing it immediately? The overhead may be nauseating,but this should delimit your messages.

In the worst case scenario you could go one level lower (raw sockets), where you have better control over the packets sent, but then you'd have to deal with all the nitty-gritty of TCP.

Maybe you could try putting the tcp stack into low-latency mode:
echo 1 > /proc/sys/net/ipv4/tcp_low_latency
That should favor emitting packets as quickly as possible over combining data. Read the man on tcp(7) for more information.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string