Application control of TCP retransmission on Linux

Application control of TCP retransmission on Linux - linux

For the impatient:
How to change the value of /proc/sys/net/ipv4/tcp_retries2 for a single connection in Linux, using setsockopt(), ioctl() or such, or is it possible?
Longer decription:
I'm developing an application that uses long polling HTTP requests. On the server side, it needs to be known when the client has closed the connection. The accuracy is not critical, but it certainly cannot be 15 minutes. Closer to a minute would do fine.
For those not familiar with the concept, a long polling HTTP request works like this:
The client sends a request
The server responds with HTTP headers, but leaves the response open. Chunked transfer encoding is used, allowing the server to sends bits of data as they become available.
When all the data is sent, the server sends a "closing chunk" to signal that the response is complete.
In my application, the server sends "heartbeats" to the client every now an then (30 seconds by default). A heartbeat is just a newline character that is sent as a response chunk. This is meant to keep the line busy so that we notify the connection loss.
There's no problem when the client shuts down correctly. But when it's shut down with force (the client machine loses power, for example), a TCP reset is not sent. In this case, the server sends a heartbeat, which the client doesn't ACK. After this, the server keeps retransmitting the packet for roughly 15 minutes after giving up and reporting the failure to the application layer (our HTTP server). And 15 minutes is too long a wait in my case.
I can control the retransmission time by writing to the following files in /proc/sys/net/ipv4/:
tcp_retries1 - INTEGER
This value influences the time, after which TCP decides, that
something is wrong due to unacknowledged RTO retransmissions,
and reports this suspicion to the network layer.
See tcp_retries2 for more details.
RFC 1122 recommends at least 3 retransmissions, which is the
default.
tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.
The default value of tcp_retries2 is indeed 8, and my experience of 15 minutes (900 seconds) of retransmission is in line with the kernel documentation quoted above.
If I change the value of tcp_retries2 to 5 for example, the connection dies much more quicker. But setting it like this affects all the connections in the system, and I'd really like to set it for this one long polling connection only.
A quote from RFC 1122:
4.2.3.5 TCP Connection Failures
Excessive retransmission of the same segment by TCP
indicates some failure of the remote host or the Internet
path. This failure may be of short or long duration. The
following procedure MUST be used to handle excessive
retransmissions of data segments [IP:11]:
(a) There are two thresholds R1 and R2 measuring the amount
of retransmission that has occurred for the same
segment. R1 and R2 might be measured in time units or
as a count of retransmissions.
(b) When the number of transmissions of the same segment
reaches or exceeds threshold R1, pass negative advice
(see Section 3.3.1.4) to the IP layer, to trigger
dead-gateway diagnosis.
(c) When the number of transmissions of the same segment
reaches a threshold R2 greater than R1, close the
connection.
(d) An application MUST be able to set the value for R2 for
a particular connection. For example, an interactive
application might set R2 to "infinity," giving the user
control over when to disconnect.
(e) TCP SHOULD inform the application of the delivery
problem (unless such information has been disabled by
the application; see Section 4.2.4.1), when R1 is
reached and before R2. This will allow a remote login
(User Telnet) application program to inform the user,
for example.
It seems to me that tcp_retries1 and tcp_retries2 in Linux correspond to R1 and R2 in the RFC. The RFC clearly states (in item d) that a conforming implementation MUST allow setting the value of R2, but I have found no way to do it using setsockopt(), ioctl() or such.
Another option would be to get a notification when R1 is exceeded (item e). This is not as good as setting R2, though, as I think R1 is hit pretty soon (in a few seconds), and the value of R1 cannot be set per connection, or at least the RFC doesn't require it.

Looks like this was added in Kernel 2.6.37.
Commit diff from kernel Git and Excerpt from change log below;
commit
dca43c75e7e545694a9dd6288553f55c53e2a3a3
Author: Jerry Chu
Date: Fri Aug 27 19:13:28 2010 +0000
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu#google.com>
Signed-off-by: David S. Miller <davem#davemloft.net>

I suggest that if the TCP_USER_TIMEOUT socket option described by Kimvais is available, you use that. On older kernels where that socket option is not present, you could repeatedly call the SIOCOUTQ ioctl() to determine the size of the socket send queue - if the send queue doesn't decrease over your timeout period, that indicates that no ACKs have been received and you can close the socket.

After some thinking (and googling), I came to the conclusion that you can't change tcp_retries1 and tcp_retries2 value for a single socket unless you apply some sort of patch to the kernel. Is that feasible for you?
Otherways, you could use TCP_KEEPALIVE socket option whose purpose is to check if a connection is still active (it seems to me that that's exactly what you are trying to achieve, so it has sense). Pay attention to the fact that you need to tweak its default parameter a little, because the default is to disconnect after about 2 hrs!!!

This is for my understanding.
tcp_retries2 is the number of retransmission that is permitted by the system before droping the conection.So if we want to change the default value of tcp_retries2 using TCP_USER_TIMEOUT which specifies the maximum amount of time transmitted data may remain unacknowledged, we have to increase the value of TCP_USER_TIMEOUT right?
In that case the conction will wait for a longer time and will not retransmit the data packet.
Please correct me, if something is wrong.

int name[] = {CTL_NET, NET_IPV4, NET_IPV4_TCP_RETRIES2};
long value = 0;
size_t size = sizeof(value);
if(!sysctl(name, sizeof(name)/sizeof(name[0]), &value, &size, NULL, 0) {
value // It contains current value from /proc/sys/net/ipv4/tcp_retries2
}
value = ... // Change value if it needed
if(!sysctl(name, sizeof(name)/sizeof(name[0]), NULL, NULL, &value, size) {
// Value in /proc/sys/net/ipv4/tcp_retries2 changed successfully
}
Programmatically way using C. It works at least on Ubuntu. But (according with code and system variables) looks like it influences on all TCP connections in system, not only one single connection.

Related

TCP_NODELAY parameter implication on MSS and CPU utilisation

While I was trying to compose test-suite using netperf I had to go through the manual where I came across the below line for option -D
If setting TCP_NODELAY with -D affects throughput and/or service demand for tests where the send size (-m) is larger than the MSS it suggests the TCP/IP stack’s implementation of the Nagle Algorithm may be broken, perhaps interpreting the Nagle Algorithm on a segment by segment basis rather than the proper user send by user send basis
My question is - how can setting TCP_DELAY and it's effect on throughput determines nagle implementation is broken? Can someone help me with a logical explanation on the same?

In broad terms Nagle goes something like:
Is the amount of data in this send, added to any data queued waiting
to be sent, larger than the Maximum Segment Size (MSS) of this
connection? If the answer is yes, then send the data now, so long
as there are no other constraints on sending such as the congestion
window or the receiver’s advertised window. If the answer is no,
then consider the next step:
Is the connection “idle?” In other
words, is there no un-ACKnowledged data out on the network. If so,
then send the data in the user’s send now. If the connection is not
idle, queue the data and wait. We know that one or more of three
things will happen to cause the data to be sent:
The application will continue making send calls and we get to an MSS-worth of data to send.
An ACKnowledgement will arrive from the remote for the currently un-ACKed data making the connection “idle” again.
The retransmission timer for the un-ACKed data will expire and we might piggy-back this data with the retransmission.
The main point being Nagle is supposed to be evaluated on a user-send by user-send basis. If disabling it results in higher throughput with a send size greater than the MSS, it implies the Nagle implementation of that stack is going TCP segment by TCP segment instead.

TCP backlog works not as expected in Linux [duplicate]

I have the following problem:
I have sockfd = socket(AF_INET, SOCK_STREAM, 0)
After I set up and bind the socket (let's say with sockfd.sin_port = htons(666)), I immediately do:
listen(sockfd, 3);
sleep(50); // for test purposes
I'm sleeping for 50 seconds to test the backlog argument, which seems to be ignored because I can establish a connection* more than 3 times on port 666.
*: What I mean is that I get a syn/ack for each Nth SYN (n>3) sent from the client and placed in the listen queue, instead of being dropped. What could be wrong? I've read the man pages of listen(2) and tcp(7) and found:
The behavior of the backlog argument on TCP sockets changed with Linux 2.2.
Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can
be set using
/proc/sys/net/ipv4/tcp_max_syn_backlog.
When syncookies are enabled there is
no logical maximum length and this
setting is ignored. See tcp(7) for
more
information.
, but even with sysctl -w sys.net.ipv4.tcp_max_syn_backlog=2 and sysctl -w net.ipv4.tcp_syncookies=0, I still get the same results! I must be missing something or completely missunderstand listen()'s backlog purpose.

The backlog argument to listen() is only advisory.
POSIX says:
The backlog argument provides a hint
to the implementation which the
implementation shall use to limit the
number of outstanding connections in
the socket's listen queue.
Current versions of the Linux kernel round it up to the next highest power of two, with a minimum of 16. The revelant code is in reqsk_queue_alloc().

Different operating systems provide different numbers of queued connections with different backlog numbers. FreeBSD appears to be one of the few OSes that actually has a 1-to-1 mapping. (source: http://books.google.com/books?id=ptSC4LpwGA0C&lpg=PA108&ots=Kq9FQogkTr&dq=berkeley%20listen%20backlog%20ack&pg=PA108#v=onepage&q=berkeley%20listen%20backlog%20ack&f=false )

BLE: Max number of packets in Connection interval

Is there a limit on maximum number of packets (LE_DATA) that could be send by either slave or master during one connection interval?
If this limit exists, are there any specific conditions for this limit (e.g. only x number of ATT data packets)?
Are master/slave required or allowed to impose such a limit by specification?

(I hope I'm not reviving a dead post. But I think the section 4.5.1 is better suited to answer this than 4.5.6.)
The spec doesn't define a limit of packets. It just states the following:
4.5.1 Connection Events - BLUETOOTH SPECIFICATION Version 4.2 [Vol 6, Part B]
(...)
The start of a connection event is called an anchor point. At the anchor point, a master shall start to transmit a Data Channel PDU to the slave. The start of connection events are spaced regularly with an interval of connInterval and shall not overlap. The master shall ensure that a connection event closes at least T_IFS before the anchor point of the next connection event. The slave listens for the packet sent by its master at the anchor point.
T_IFS is the "Inter Frame Space" time and shall be 150 μs. Simply put it's the job of the master to solve this problem. As far as I know iOS limits the packet number to 4 per connection event for instance. Android may have other hard coded limits depending on the OS version.

There is max data rate that can be achieved both on BT and BLE. You can tweak this data rate by changing MTU (maximum transmission unit - packet size) up to max MTU both ends of transmission can handle. But AFAIK there is no straight constraint on number of packets, besides physical ones imposed by the data rate.
You can find more in the spec

I could find the following in Bluetooth Spec v4.2:
4.5.6 Closing Connection Events
The MD bit of the Header of the Data Channel PDU is used to indicate
that the device has more data to send. If neither device has set the
MD bit in their packets, the packet from the slave closes the
connection event. If either or both of the devices have set the MD
bit, the master may continue the connection event by sending another
packet, and the slave should listen after sending its packet. If a
packet is not received from the slave by the master, the master will
close the connection event. If a packet is not received from the
master by the slave, the slave will close the connection event.
Two consecutive packets received with an invalid CRC match within a
connection event shall close the event.
This means both slave and masters have self-imposed limit on number of packets they want to transmit during a CI. When either party doesn't wish to send more data, they just set this bit to 0 and other one understands. This should usually be driven by number of pending packets on either side.
Since I was looking for logical limits due to spec or protocol, this probably answers my question.
Physical limits to number packets per CI would be governed by data rate, and as #morynicz mentioned, on MTU etc.

From my understanding, the limit is: min{max master event length, max slave event length, connection interval}.
To clarify, both the master and slave devices (specifically, the BLE stack thereof) typically have event length or "GAP event length" times. This time limit may be used to allow a central and/or advertiser and/or broadcaster to schedule the "phase offset" of more than one BLE radio activity, and/or limit the CPU usage of the BLE stack for application processing needs. E.g. a Nordic SoftDevice stack may have a default event length of 3.75ms that is indefinitely extendable (up to the connection interval) based on other demands on the SoftDevice's scheduler. In Android and iOS BLE implementations, this value may be opaque or not specified (e.g. Android may set this value to "0", which leaves the decision up to the controller implementation associated with the BLE chip of that device).
Note also that either the master or the slave may effectively "drop out" of a connection event earlier than these times if their TX/RX buffers are filled (e.g. Nordic SoftDevice stack may have a buffer size of 6 packets/frames). This may be implemented by not setting the MD bit (if TX buffer is exhausted) and/or "nacking" with the NESN bit (if RX buffer is full). However, while the master device can actually "drop out" by ending the connection event (not sending any more packets), the slave device must listen for master packets as long as at least one of master and slave have the MD bit set and the master continues to transmit packets (e.g. the slave could keep telling the master that it has no more data and also keep NACKing master packets because it has no more buffer space for the current connection event, but the master may keep trying to send for as long as it wants during the connection interval; not sure how/if the controller stack implements any "smarts" regarding this).
If there are no limits from either device in terms of stack-specified event length or buffer size, then presumably packets could be transmitted back and forth the entire connection interval (assuming at least one side had data to send and therefore set the MD bit). Just note for throughput calculation purposes that there is a T_IFS spacing (currently specified at 150us) between each packet and before the end of the connection interval.

__connect_no_cancel blocks and server gets data out of order

I have a TCP server using select to get data from a client through TCP socket.
The Server is slow in consuming data while the client is much faster. My client sends 8 bytes of data and each time it
-open a new connection
-write data
-disconnect
Because of this ( the server socket must accept many connection ) I increased the backlock value of listen to 500.
Despite this setting, at some point I can see that
-my client blocks in a pthread function called __connect_nocancel and this happens many times.
-after a while my server starts receiving data out of orders. The first data messed up is the one where the client blocks ( followed by other ).
I thought that increasing the backlog may fix this but this issue but this is not the case.
Can You help me? I am in Linux 2.6.32
Cheers
AFG

The backlog parameter of listen(2) is usually capped to some value inside the OS network stack. On Linux the default is 128.
The real problem though is, as #EJP is saying, you are totally mis-using TCP.

If ordering is important, your client must just keep a single connection open and write everything via that single connection. There are no two ways about this. TCP guarantees byte ordering withing the stream. Nothing guarantees the ordering of server-side processing of distinct connections.
It's also considerably more efficient. At present you are exchanging about eight packets for every eight bytes, which implies an overhead of up to 160 bytes.

Linux - TCP connect() failure with ETIMEDOUT

For a TCP client connect() call to a TCP server..
UNIX® Network Programming book by Richard Stevens says the following..
If the client TCP receives no response to its SYN segment, ETIMEDOUT is returned. 4.4BSD,
for example, sends one SYN when connect is called, another 6 seconds later, and another
24 seconds later (p. 828 of TCPv2). If no response is received after a total of 75 seconds, the
error is returned.
In Linux I would like know what is the retry mechanism (how many times and how far apart). Asking because for a TCP client connect() call I am getting ETIMEDOUT error. This socket has O_NONBLOCK option and monitored by epoll() for the events.
If someone can point to me where in the code this retry logic is implemented that would be helpful too. I tried following a bit starting with tcp_v4_connect() from net/ipv4/tcp_ipv4.c, but lost my way pretty soon..

The timeout is scaled based on the measured round-trip time.
tcp_connect() sets up a timer:
/* Timer for repeating the SYN until an answer. */
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
The icsk_rto will use a per-destination re-transmission timeout; if previous metrics from the destination is available from previous connections, it is re-used. (See the tcp_no_metrics_save discussion in tcp(7) for details.) If no metrics are saved, then the kernel will fall back to a default RTO value:
#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* RFC2988bis initial RTO value */
#define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
* used as a fallback RTO for the
* initial data transmission if no
* valid RTT sample has been acquired,
* most likely due to retrans in 3WHS.
*/
tcp_retransmit_timer() has some code near the bottom for recalculating the delay:
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
__sk_dst_reset(sk);
retransmits_timed_out() will first perform a linear backoff then an exponential backoff.
I think the long and the short of it is that you can reasonably expect roughly 120 seconds before getting ETIMEDOUT error returns from connect(2) unless the kernel has good reason to suspect that the remote peer should have replied sooner.

A typical reason for ETIMEOUT is a firewall which simply swallows the packets instead of replying with ICMP Destination Unreachable.
This is a common setup to prevent hackers from probing a network for hosts.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string