TCP backlog works not as expected in Linux [duplicate] - linux

I have the following problem:
I have sockfd = socket(AF_INET, SOCK_STREAM, 0)
After I set up and bind the socket (let's say with sockfd.sin_port = htons(666)), I immediately do:
listen(sockfd, 3);
sleep(50); // for test purposes
I'm sleeping for 50 seconds to test the backlog argument, which seems to be ignored because I can establish a connection* more than 3 times on port 666.
*: What I mean is that I get a syn/ack for each Nth SYN (n>3) sent from the client and placed in the listen queue, instead of being dropped. What could be wrong? I've read the man pages of listen(2) and tcp(7) and found:
The behavior of the backlog argument on TCP sockets changed with Linux 2.2.
Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can
be set using
/proc/sys/net/ipv4/tcp_max_syn_backlog.
When syncookies are enabled there is
no logical maximum length and this
setting is ignored. See tcp(7) for
more
information.
, but even with sysctl -w sys.net.ipv4.tcp_max_syn_backlog=2 and sysctl -w net.ipv4.tcp_syncookies=0, I still get the same results! I must be missing something or completely missunderstand listen()'s backlog purpose.

The backlog argument to listen() is only advisory.
POSIX says:
The backlog argument provides a hint
to the implementation which the
implementation shall use to limit the
number of outstanding connections in
the socket's listen queue.
Current versions of the Linux kernel round it up to the next highest power of two, with a minimum of 16. The revelant code is in reqsk_queue_alloc().

Different operating systems provide different numbers of queued connections with different backlog numbers. FreeBSD appears to be one of the few OSes that actually has a 1-to-1 mapping. (source: http://books.google.com/books?id=ptSC4LpwGA0C&lpg=PA108&ots=Kq9FQogkTr&dq=berkeley%20listen%20backlog%20ack&pg=PA108#v=onepage&q=berkeley%20listen%20backlog%20ack&f=false )

Related

What is the difference between tcp_max_syn_backlog and somaxconn?

I have been reading some articles about TCP implementation on Linux and I got confused, what is the difference between net.ipv4.tcp_max_syn_backlog and net.core.somaxconn and the backlog passed as parameter to listen() system call, and what is the relation between them.
P.S. I want explanation for kernel 4.15 because I found that there are some differences between oldest and newer kernels on this subject.
sysctl is an API. So you can just read the Linux kernel documentation for appropriate version:
tcp_max_syn_backlog - INTEGER
Maximal number of remembered connection requests, which have not
received an acknowledgment from connecting client.
The minimal value is 128 for low memory machines, and it will
increase in proportion to the memory of machine.
If server suffers from overload, try increasing this number.
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
Defaults to 128. See also tcp_max_syn_backlog for additional tuning
for TCP sockets.
Let's consider a TCP-handshake.. tcp_max_syn_backlog represents the maximal number of connections in SYN_RECV queue. I.e. when your server received SYN, sent SYN-ACK and haven't received ACK yet. This is a separate queue of so-called "request sockets" - reqsk in code (i.e. not fully-fledged sockets, "request sockets" occupy less memory. In this state we can save some memory and not yet allocate a full socket because the full connection may not be at all in the future if ACK will not arrive). The value of this queue is affected (see this post) by listen()'s backlog argument and limited by tcp_max_syn_backlog in kernel.
somaxconn represents the maximal size of ESTABLISHED queue. This is another queue.
Recall the previously mentioned SYN_RECV queue - your server is waiting for ACK from client. When the ACK arrives the kernel roughly speaking makes the big full-fledged socket from "request socket" and moves it to ESTABLISHED queue. Then you can do accept() on this socket. This queue is also affected by listen()'s backlog argument and limited by somaxconn in kernel.
Useful links: 1, 2.

How do I get amount of queued data for UDP socket?

To see how well I'm doing in processing incoming data, I'd like to measure the queue length at my TCP and UDP sockets.
I know that I can get the queue size via SO_RCVBUF socket option, and that ioctl(<sockfd>, SIOCINQ, &<some_int>) tells me the information for TCP sockets. But for UDP the SIOCINQ/FIONREAD ioctl returns only the size of next pending datagram. Is there a way how to get queue size for UDP, without having to parse system tables such as /proc/net/udp?
FWIW, I did some experiments to map out the behavior of FIONREAD on different platforms.
Platforms where FIONREAD returns all the data pending in a SOCK_DGRAM socket:
Mac OS X, NetBSD, FreeBSD, Solaris, HP-UX, AIX, Windows
Platforms where FIONREAD returns only the bytes for the first pending datagram:
Linux
It might also be worth noting that some implementations include headers or other overhead bytes in the count, while others only count the payload bytes. Linux appears to return the payload size, not including IP headers.
As ldx mentioned, it is not supported through ioctl or getsockopt.
It seems to me that the current implementation of SIOCINQ was aimed to determine how much buffer is needed to read the entire waiting buffer (but I guess it is not so useful for that, as it can change between the read of it to the actual buffer read).
There are many other telemetries which are not supported though such system calls, I guess there is no real need in normal production usage.
You can check the drops/errors through "netstat -su" , or better using SNMP (udpInErrors) if you just want to monitor the machine state.
BTW: You always have the option to hack in the Kernel code and add this value (or others).

Application control of TCP retransmission on Linux

For the impatient:
How to change the value of /proc/sys/net/ipv4/tcp_retries2 for a single connection in Linux, using setsockopt(), ioctl() or such, or is it possible?
Longer decription:
I'm developing an application that uses long polling HTTP requests. On the server side, it needs to be known when the client has closed the connection. The accuracy is not critical, but it certainly cannot be 15 minutes. Closer to a minute would do fine.
For those not familiar with the concept, a long polling HTTP request works like this:
The client sends a request
The server responds with HTTP headers, but leaves the response open. Chunked transfer encoding is used, allowing the server to sends bits of data as they become available.
When all the data is sent, the server sends a "closing chunk" to signal that the response is complete.
In my application, the server sends "heartbeats" to the client every now an then (30 seconds by default). A heartbeat is just a newline character that is sent as a response chunk. This is meant to keep the line busy so that we notify the connection loss.
There's no problem when the client shuts down correctly. But when it's shut down with force (the client machine loses power, for example), a TCP reset is not sent. In this case, the server sends a heartbeat, which the client doesn't ACK. After this, the server keeps retransmitting the packet for roughly 15 minutes after giving up and reporting the failure to the application layer (our HTTP server). And 15 minutes is too long a wait in my case.
I can control the retransmission time by writing to the following files in /proc/sys/net/ipv4/:
tcp_retries1 - INTEGER
This value influences the time, after which TCP decides, that
something is wrong due to unacknowledged RTO retransmissions,
and reports this suspicion to the network layer.
See tcp_retries2 for more details.
RFC 1122 recommends at least 3 retransmissions, which is the
default.
tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.
The default value of tcp_retries2 is indeed 8, and my experience of 15 minutes (900 seconds) of retransmission is in line with the kernel documentation quoted above.
If I change the value of tcp_retries2 to 5 for example, the connection dies much more quicker. But setting it like this affects all the connections in the system, and I'd really like to set it for this one long polling connection only.
A quote from RFC 1122:
4.2.3.5 TCP Connection Failures
Excessive retransmission of the same segment by TCP
indicates some failure of the remote host or the Internet
path. This failure may be of short or long duration. The
following procedure MUST be used to handle excessive
retransmissions of data segments [IP:11]:
(a) There are two thresholds R1 and R2 measuring the amount
of retransmission that has occurred for the same
segment. R1 and R2 might be measured in time units or
as a count of retransmissions.
(b) When the number of transmissions of the same segment
reaches or exceeds threshold R1, pass negative advice
(see Section 3.3.1.4) to the IP layer, to trigger
dead-gateway diagnosis.
(c) When the number of transmissions of the same segment
reaches a threshold R2 greater than R1, close the
connection.
(d) An application MUST be able to set the value for R2 for
a particular connection. For example, an interactive
application might set R2 to "infinity," giving the user
control over when to disconnect.
(e) TCP SHOULD inform the application of the delivery
problem (unless such information has been disabled by
the application; see Section 4.2.4.1), when R1 is
reached and before R2. This will allow a remote login
(User Telnet) application program to inform the user,
for example.
It seems to me that tcp_retries1 and tcp_retries2 in Linux correspond to R1 and R2 in the RFC. The RFC clearly states (in item d) that a conforming implementation MUST allow setting the value of R2, but I have found no way to do it using setsockopt(), ioctl() or such.
Another option would be to get a notification when R1 is exceeded (item e). This is not as good as setting R2, though, as I think R1 is hit pretty soon (in a few seconds), and the value of R1 cannot be set per connection, or at least the RFC doesn't require it.
Looks like this was added in Kernel 2.6.37.
Commit diff from kernel Git and Excerpt from change log below;
commit
dca43c75e7e545694a9dd6288553f55c53e2a3a3
Author: Jerry Chu
Date: Fri Aug 27 19:13:28 2010 +0000
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu#google.com>
Signed-off-by: David S. Miller <davem#davemloft.net>
I suggest that if the TCP_USER_TIMEOUT socket option described by Kimvais is available, you use that. On older kernels where that socket option is not present, you could repeatedly call the SIOCOUTQ ioctl() to determine the size of the socket send queue - if the send queue doesn't decrease over your timeout period, that indicates that no ACKs have been received and you can close the socket.
After some thinking (and googling), I came to the conclusion that you can't change tcp_retries1 and tcp_retries2 value for a single socket unless you apply some sort of patch to the kernel. Is that feasible for you?
Otherways, you could use TCP_KEEPALIVE socket option whose purpose is to check if a connection is still active (it seems to me that that's exactly what you are trying to achieve, so it has sense). Pay attention to the fact that you need to tweak its default parameter a little, because the default is to disconnect after about 2 hrs!!!
This is for my understanding.
tcp_retries2 is the number of retransmission that is permitted by the system before droping the conection.So if we want to change the default value of tcp_retries2 using TCP_USER_TIMEOUT which specifies the maximum amount of time transmitted data may remain unacknowledged, we have to increase the value of TCP_USER_TIMEOUT right?
In that case the conction will wait for a longer time and will not retransmit the data packet.
Please correct me, if something is wrong.
int name[] = {CTL_NET, NET_IPV4, NET_IPV4_TCP_RETRIES2};
long value = 0;
size_t size = sizeof(value);
if(!sysctl(name, sizeof(name)/sizeof(name[0]), &value, &size, NULL, 0) {
value // It contains current value from /proc/sys/net/ipv4/tcp_retries2
}
value = ... // Change value if it needed
if(!sysctl(name, sizeof(name)/sizeof(name[0]), NULL, NULL, &value, size) {
// Value in /proc/sys/net/ipv4/tcp_retries2 changed successfully
}
Programmatically way using C. It works at least on Ubuntu. But (according with code and system variables) looks like it influences on all TCP connections in system, not only one single connection.

sendto on Tru64 is returning ENOBUF

I am currently running an old system on Tru64 which involves lots of UDP sockets using the sendto() function. The sockets are used in our code to send messages to/from various processes and then eventually on to a thick client app that is connected remotely. Occasionally the socket to the thick client gets stuck, this can cause some of these messages to get built up. My question is how can I determine the current buffer size, and how do I determine the maximum message buffer. The code below gives a snippet of how I set up the port and use the sendto function.
/* need to adjust the maximum size we can send on this */
/* as it needs to be able to cope with the biggest */
/* messages we send */
lenlen = sizeof(len) ;
/* allow double for when the system is under load */
int lenlen, len ;
lenlen = sizeof(len) ;
len = 2 * 32000;
msg_socket = socket( AF_UNIX,SOCK_DGRAM, 0);
result = setsockopt(msg_socket, SOL_SOCKET, SO_SNDBUF, (char *)&len, lenlen) ;
result = sendto( msg_socket,
(char *)message,
(int)message_len,
flags,
dest_addr,
addrlen);
Note. We have ported this application to Linux and the problem does not seem to appear there.
Any help would be greatly appreciated.
Regards
UDP send buffer size is different from TCP - it just limits the size of the datagram. Quoting Stevens UNP Vol. 1:
...
A UDP socket has a send buffer size (which we can change with SO_SNDBUF socket option, Section 7.5), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned. Since UDP is unreliable, it does not need to keep a copy of the application's data and does not need an actual send buffer. (The application data is normally copied into a kernel buffer of some form as it passes down the protocol stack, but this copy is discarded by the datalink layer after the data is transmitted.)
UDP simply prepends 8-byte header and passes the datagram to IP. IPv4 or IPv6 prepends its header, determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue. If a UDO application sends large datagrams (say 2,000-byte datagrams), there's a much higher probability of fragmentation than with TCP. because TCP breaks the application data into MSS-sized chunks, something that has no counterpart in UDP.
The successful return from write to a UDP socket tells us that either the datagram or all fragments of the datagram have been added to the datalink output queue. If there is no room on the queue for the datagram or one of its fragments, ENOBUFS is often returned to the application.
Unfortunately, some implementations do not return this error, giving the application no indication that the datagram was discarded without even being transmitted.
The last footnote needs attention - but it looks like Tru64 has this error code listed in the manual page.
The proper way of doing it though is to queue your outstanding messages in the application itself and to carefully check return values and the errno after each system call. This still does not guarantee delivery (since UDP receivers might drop the packets without any notice to the senders). Check the UDP packet discard counters with netstat -s on both/all sides, see if they are growing. There is really no way around this besides switching to TCP or implementing your own timeout/ack and re-transmission logic.
You should probably be using some sort of congestion control to avoid overloading the network. By far the easiest way to do this is to use TCP instead of UDP.
It fails less often on Linux because UDP sockets wait for space in the local network interface queue on Linux (unless you set them non-blocking). However, with any operating system, if the overfull queue is not in the local system, the packet will be dropped silently.

Increasing the maximum number of TCP/IP connections in Linux

I am programming a server and it seems like my number of connections is being limited since my bandwidth isn't being saturated even when I've set the number of connections to "unlimited".
How can I increase or eliminate a maximum number of connections that my Ubuntu Linux box can open at a time? Does the OS limit this, or is it the router or the ISP? Or is it something else?
Maximum number of connections are impacted by certain limits on both client & server sides, albeit a little differently.
On the client side:
Increase the ephermal port range, and decrease the tcp_fin_timeout
To find out the default values:
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_fin_timeout
The ephermal port range defines the maximum number of outbound sockets a host can create from a particular I.P. address. The fin_timeout defines the minimum time these sockets will stay in TIME_WAIT state (unusable after being used once).
Usual system defaults are:
net.ipv4.ip_local_port_range = 32768 61000
net.ipv4.tcp_fin_timeout = 60
This basically means your system cannot consistently guarantee more than (61000 - 32768) / 60 = 470 sockets per second. If you are not happy with that, you could begin with increasing the port_range. Setting the range to 15000 61000 is pretty common these days. You could further increase the availability by decreasing the fin_timeout. Suppose you do both, you should see over 1500 outbound connections per second, more readily.
To change the values:
sysctl net.ipv4.ip_local_port_range="15000 61000"
sysctl net.ipv4.tcp_fin_timeout=30
The above should not be interpreted as the factors impacting system capability for making outbound connections per second. But rather these factors affect system's ability to handle concurrent connections in a sustainable manner for large periods of "activity."
Default Sysctl values on a typical Linux box for tcp_tw_recycle & tcp_tw_reuse would be
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_tw_reuse=0
These do not allow a connection from a "used" socket (in wait state) and force the sockets to last the complete time_wait cycle. I recommend setting:
sysctl net.ipv4.tcp_tw_recycle=1
sysctl net.ipv4.tcp_tw_reuse=1
This allows fast cycling of sockets in time_wait state and re-using them. But before you do this change make sure that this does not conflict with the protocols that you would use for the application that needs these sockets. Make sure to read post "Coping with the TCP TIME-WAIT" from Vincent Bernat to understand the implications. The net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you. Note that net.ipv4.tcp_tw_recycle has been removed from Linux 4.12.
On the Server Side:
The net.core.somaxconn value has an important role. It limits the maximum number of requests queued to a listen socket. If you are sure of your server application's capability, bump it up from default 128 to something like 128 to 1024. Now you can take advantage of this increase by modifying the listen backlog variable in your application's listen call, to an equal or higher integer.
sysctl net.core.somaxconn=1024
txqueuelen parameter of your ethernet cards also have a role to play. Default values are 1000, so bump them up to 5000 or even more if your system can handle it.
ifconfig eth0 txqueuelen 5000
echo "/sbin/ifconfig eth0 txqueuelen 5000" >> /etc/rc.local
Similarly bump up the values for net.core.netdev_max_backlog and net.ipv4.tcp_max_syn_backlog. Their default values are 1000 and 1024 respectively.
sysctl net.core.netdev_max_backlog=2000
sysctl net.ipv4.tcp_max_syn_backlog=2048
Now remember to start both your client and server side applications by increasing the FD ulimts, in the shell.
Besides the above one more popular technique used by programmers is to reduce the number of tcp write calls. My own preference is to use a buffer wherein I push the data I wish to send to the client, and then at appropriate points I write out the buffered data into the actual socket. This technique allows me to use large data packets, reduce fragmentation, reduces my CPU utilization both in the user land and at kernel-level.
There are a couple of variables to set the max number of connections. Most likely, you're running out of file numbers first. Check ulimit -n. After that, there are settings in /proc, but those default to the tens of thousands.
More importantly, it sounds like you're doing something wrong. A single TCP connection ought to be able to use all of the bandwidth between two parties; if it isn't:
Check if your TCP window setting is large enough. Linux defaults are good for everything except really fast inet link (hundreds of mbps) or fast satellite links. What is your bandwidth*delay product?
Check for packet loss using ping with large packets (ping -s 1472 ...)
Check for rate limiting. On Linux, this is configured with tc
Confirm that the bandwidth you think exists actually exists using e.g., iperf
Confirm that your protocol is sane. Remember latency.
If this is a gigabit+ LAN, can you use jumbo packets? Are you?
Possibly I have misunderstood. Maybe you're doing something like Bittorrent, where you need lots of connections. If so, you need to figure out how many connections you're actually using (try netstat or lsof). If that number is substantial, you might:
Have a lot of bandwidth, e.g., 100mbps+. In this case, you may actually need to up the ulimit -n. Still, ~1000 connections (default on my system) is quite a few.
Have network problems which are slowing down your connections (e.g., packet loss)
Have something else slowing you down, e.g., IO bandwidth, especially if you're seeking. Have you checked iostat -x?
Also, if you are using a consumer-grade NAT router (Linksys, Netgear, DLink, etc.), beware that you may exceed its abilities with thousands of connections.
I hope this provides some help. You're really asking a networking question.
To improve upon the answer given by #derobert,
You can determine what your OS connection limit is by catting nf_conntrack_max. For example:
cat /proc/sys/net/netfilter/nf_conntrack_max
You can use the following script to count the number of TCP connections to a given range of tcp ports. By default 1-65535.
This will confirm whether or not you are maxing out your OS connection limit.
Here's the script.
#!/bin/sh
OS=$(uname)
case "$OS" in
'SunOS')
AWK=/usr/bin/nawk
;;
'Linux')
AWK=/bin/awk
;;
'AIX')
AWK=/usr/bin/awk
;;
esac
netstat -an | $AWK -v start=1 -v end=65535 ' $NF ~ /TIME_WAIT|ESTABLISHED/ && $4 !~ /127\.0\.0\.1/ {
if ($1 ~ /\./)
{sip=$1}
else {sip=$4}
if ( sip ~ /:/ )
{d=2}
else {d=5}
split( sip, a, /:|\./ )
if ( a[d] >= start && a[d] <= end ) {
++connections;
}
}
END {print connections}'
In an application level, here are something a developer can do:
From server side:
Check if load balancer(if you have),works correctly.
Turn slow TCP timeouts into 503 Fast Immediate response, if you load balancer work correctly, it should pick the working resource to serve, and it's better than hanging there with unexpected error massages.
Eg: If you are using node server, u can use toobusy from npm.
Implementation something like:
var toobusy = require('toobusy');
app.use(function(req, res, next) {
if (toobusy()) res.send(503, "I'm busy right now, sorry.");
else next();
});
Why 503? Here are some good insights for overload:
http://ferd.ca/queues-don-t-fix-overload.html
We can do some work in client side too:
Try to group calls in batch, reduce the traffic and total requests number b/w client and server.
Try to build a cache mid-layer to handle unnecessary duplicates requests.
im trying to resolve this in 2022 on loadbalancers and one way I found is to attach another IPv4 (or eventualy IPv6) to NIC, so the limit is now doubled. Of course you need to configure the second IP to the service which is trying to connect to the machine (in my case another DNS entry)

Resources