Dropping of connections with tcp_tw_recycle - linux

summary of the problem
we are having a setup wherein a lot(800 to 2400 per second( of incoming connections to a linux box and we have a NAT device between the client and server.
so there are so many TIME_WAIT sockets left in the system.
To overcome that we had set tcp_tw_recycle to 1, but that led to drop of in comming connections.
after browsing through the net we did find the references for why the dropping of frames with tcp_tw_recycle and NAT device happens.
resolution tried
we then tried by setting tcp_tw_reuse to 1 it worked fine without any issues with the same setup and configuration.
But the documentation says that tcp_tw_recycle and tcp_tw_reuse should not be used when the Connections that go through TCP state aware nodes, such as firewalls, NAT devices or load balancers may see dropped frames. The more connections there are, the more likely you will see this issue.
Queries
1) can tcp_tw_reuse be used in this type of scenarios?
2) if not, which part of the linux code is preventing tcp_tw_reuse being used for such scenario?
3) generally what is the difference between tcp_tw_recycle and tcp_tw_reuse?

By default, when both tcp_tw_reuse and tcp_tw_recycle are disabled, the kernel will make sure that sockets in TIME_WAIT state will remain in that state long enough -- long enough to be sure that packets belonging to future connections will not be mistaken for late packets of the old connection.
When you enable tcp_tw_reuse, sockets in TIME_WAIT state can be used before they expire, and the kernel will try to make sure that there is no collision regarding TCP sequence numbers. If you enable tcp_timestamps (a.k.a. PAWS, for Protection Against Wrapped Sequence Numbers), it will make sure that those collisions cannot happen. However, you need TCP timestamps to be enabled on both ends (at least, that's my understanding). See the definition of tcp_twsk_unique for the gory details.
When you enable tcp_tw_recycle, the kernel becomes much more aggressive, and will make assumptions on the timestamps used by remote hosts. It will track the last timestamp used by each remote host having a connection in TIME_WAIT state), and allow to re-use a socket if the timestamp has correctly increased. However, if the timestamp used by the host changes (i.e. warps back in time), the SYN packet will be silently dropped, and the connection won't establish (you will see an error similar to "connect timeout"). If you want to dive into kernel code, the definition of tcp_timewait_state_process might be a good starting point.
Now, timestamps should never go back in time; unless:
the host is rebooted (but then, by the time it comes back up, TIME_WAIT socket will probably have expired, so it will be a non issue);
the IP address is quickly reused by something else (TIME_WAIT connections will stay a bit, but other connections will probably be struck by TCP RST and that will free up some space);
network address translation (or a smarty-pants firewall) is involved in the middle of the connection.
In the latter case, you can have multiple hosts behind the same IP address, and therefore, different sequences of timestamps (or, said timestamps are randomized at each connection by the firewall). In that case, some hosts will be randomly unable to connect, because they are mapped to a port for which the TIME_WAIT bucket of the server has a newer timestamp. That's why the docs tell you that "NAT devices or load balancers may start drop frames because of the setting".
Some people recommend to leave tcp_tw_recycle alone, but enable tcp_tw_reuse and lower tcp_fin_timeout. I concur :-)

Related

ArangoDB Could not connect

arangod is running for some time without any problems, but at some point no more connections can be made.
aranogsh then shows the following error message:
Error message 'Could not connect to 'tcp://127.0.0.1:8529' 'connect() failed with #99 - Cannot assign requested address''
In the log file arangod still writes more trace information.
After restarting aranogd it is running without problems again, until the problem suddenly reoccurs.
Why is this happening?
Since this question was sort of answered by time, I'll use this answer to elaborate howto dig into such a situation and to get a valuable analysis on which operating system parameters to look. I'll base this on linux targets.
First we need to find out whats currently going on using the netstat tool as a root user (we care for tcp ports only):
netstat -alnpt
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
...
tcp 0 0 0.0.0.0:8529 0.0.0.0:* LISTEN 3478/arangod
tcp 0 0 127.0.0.1:45218 127.0.0.1:8529 ESTABLISHED 6902/arangosh
tcp 1 0 127.0.0.1:46985 127.0.0.1:8529 CLOSE_WAIT 485/arangosh
We see an overview of the 3 possible value groups:
LISTEN: These are daemon processes offering tcp services to remote ends, in this case the arangod process with its server socket. It binds port 8529 on all available ipv4 addresses of the system (0.0.0.0) and accepts connections from any remote location (0.0.0.0:*)
ESTABLISHED: this is an active tcp connection in this case between arangosh and arangod; Arangosh has its client port (45218) in the higher range connecting arangod on port 8529.
CLOSE_WAIT: this is a connection in termination state. Its normal to have them. The TCP stack of the operating system keeps them around for a while to have a knowledge where to sort in stray TCP-packages that may have been sent, but did not arive on time.
As you see TCP ports are 16 bits unsigned integers, ranging from 0 to 65535. Server sockets start from the lower end, and most operating systems require processes to be running as root to bind ports below 1024. Client sockets start from the upper end and range down to a specified limit on the client. Since multiple clients can connect one server, while the server port range seems narrow, its usually the client side ports that wear out. If the client frequently closes and reopens the connection you may see many sockets in CLOSE_WAIT state, as many discussions across the net hint, these are the symptoms of your system eventually running out of resources. In general the solution to this problem is to to re-use existing connections through the keepalive feature.
As the solaris ndd command explains thoroughly which parameters it may modify with which consequences in the solaris kernel, the terms explained there are rather generic to tcp sockets, and may be found on many other operating systems in other ways; in linux - which we focus on here - through the /proc/sys/net-filesystem.
Some valuable switches there are:
ipv4/ip_local_port_range This is the range for the local sockets. You can try to narrow it, and use arangob --keep-alive false to explore whats happening if your system runs out of these.
time wait (often shorted to tw) is the section that controls what the TCP-Stack should do with already closed sockets in CLOSE_WAIT state. The Linux kernel can do a trick here - it can instantly re-use connections in that state for new connections. Vincent Bernat explains very nicely which screws to turn and what the differnt parameters in the kernel mean.
So once you decided to change some of your values in /proc so your host better scales to the given situation, you need to make them reboot persistant - since /proc is volatile and won't remember values across reboots.
Most linux systems therefore offer the /etc/sysctl.[d|conf] file; It maps slashes in the proc filesystem to dots, so /proc/sys/net/ipv4/tcp_tw_reuse will translate into net.ipv4.tcp_tw_reuse.

Permanent TCP connection for administration use

I am facing the following situation:
I have several devices (embedded devices running ARCH Linux) and i would like to have administration access to each device at any time. The problem is the devices are behind a NAT, so establishing a connection from a server to a device is not possible. How could i achieve this?
I thought i could write a simple service running on the device that opens a connection to a server at startup. This TCP connection remains open and can be used from the server to administrate the device. But is it a good idea to keep TCP connections open for a long time? If i have a lot of devices, for example 1000, will i have a problem on the server side with 1000 open TCP connections?
Is there maybe another way?
Thanks a lot!
But is it a good idea to keep TCP connections open for a long time?
It's not necessarily a bad idea; although in practice the connections will fail from time to time (e.g. due to network reconfiguration, temporary network outages, etc), so your clients should contain logic to reconnect automatically when this happens. Also note that TCP will not usually not detect it when a completely-idle TCP connection no longer has connectivity, so to avoid "zombie connections" that aren't actually connected, you may want to either enable SO_KEEPALIVE, or have your clients and/or server send the (very occasional) bit of dummy data on the socket just to goose the TCP stack into checking whether connectivity still exists on the socket.
If i have a lot of devices, for example 1000, will i have a problem on the server side with 1000 open TCP connections?
Scaling is definitely an issue you'll need to think about. For example, select() is typically implemented to only handle up to a fixed number of connections (often 1024), or if your server is using the thread-per-connection model, you'd find that a process with 1000+ threads is not very efficient. Check out the c10k problem article for lots of interesting details about various approaches and how well they scale up (or don't).
Is there maybe another way?
If you don't need immediate access to the clients, you could always have them check in periodically instead (e.g. once every 5 minutes); or you could have them occasionally send a UDP packet to the server instead of keeping a TCP connection all the time, just to let the server know their presence, and have the server indicate to them somehow (e.g. by updating a well-known web page that the clients check from time to time) when it wanted one of them to open a full TCP connection. Or maybe just use multiple servers to share the load...
The only limit I know of is imposed by state tracking in the iptables code. Check the value of net.ipv4.netfilter.ip_conntrack_max on both sides if you're using this to make sure you have enough headroom for other activities.
If you set the socket option SO_KEEPALIVE before the connect() call, the kernel will send TCP keepalives to make sure the far end is still there. This will mean that connections won't linger forever in the event of a reboot.

tcp_tw_reuse vs tcp_tw_recycle : Which to use (or both)?

I have a website and application which use a significant number of connections. It normally has about 3,000 connections statically open, and can receive anywhere from 5,000 to 50,000 connection attempts in a few seconds time frame.
I have had the problem of running out of local ports to open new connections due to TIME_WAIT status sockets. Even with tcp_fin_timeout set to a low value (1-5), this seemed to just be causing too much overhead/slowdown, and it would still occasionally be unable to open a new socket.
I've looked at tcp_tw_reuse and tcp_tw_recycle, but I am not sure which of these would be the preferred choice, or if using both of them is an option.
According to Linux documentation, you should use the TCP_TW_REUSE flag to allow reusing sockets in TIME_WAIT state for new connections.
It seems to be a good option when dealing with a web server that have to handle many short TCP connections left in a TIME_WAIT state.
As described here, The TCP_TW_RECYCLE could cause some problems when using load balancers...
EDIT (to add some warnings ;) ):
as mentionned in comment by #raittes, the "problems when using load balancers" is about public-facing servers. When recycle is enabled, the server can't distinguish new incoming connections from different clients behind the same NAT device.
NOTE: net.ipv4.tcp_tw_recycle has been removed from Linux in 4.12 (4396e46187ca tcp: remove tcp_tw_recycle).
SOURCE: https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux
pevik mentioned an interesting blog post going the extra mile in describing all available options at the time.
Modifying kernel options must be seen as a last-resort option, and shall generally be avoided unless you know what you are doing... if that were the case you would not be asking for help over here. Hence, I would advise against doing that.
The most suitable piece of advice I can provide is pointing out the part describing what a network connection is: quadruplets (client address, client port, server address, server port).
If you can make the available ports pool bigger, you will be able to accept more concurrent connections:
Client address & client ports you cannot multiply (out of your control)
Server ports: you can only change by tweaking a kernel parameter: less critical than changing TCP buckets or reuse, if you know how much ports you need to leave available for other processes on your system
Server addresses: adding addresses to your host and balancing traffic on them:
behind L4 systems already sized for your load or directly
resolving your domain name to multiple IP addresses (and hoping the load will be shared across addresses through DNS for instance)
According to the VMWare document, the main difference is TCP_TW_REUSE works only on outbound communications.
TCP_TW_REUSE uses server-side time-stamps to allow the server to use a time-wait socket port number for outbound communications once the time-stamp is larger than the last received packet. The use of these time-stamps allows duplicate packets or delayed packets from the old connection to be discarded safely.
TCP_TW_RECYCLE uses the same server-side time-stamps, however it affects both inbound and outbound connections. This is useful when the server is the first party to initiate connection closure. This allows a new client inbound connection from the source IP to the server. Due to this difference, it causes issues where client devices are behind NAT devices, as multiple devices attempting to contact the server may be unable to establish a connection until the Time-Wait state has aged out in its entirety.

What is the meaning of SO_REUSEADDR (setsockopt option) - Linux? [duplicate]

This question already has answers here:
How do SO_REUSEADDR and SO_REUSEPORT differ?
(2 answers)
Closed 9 years ago.
From the man page:
SO_REUSEADDR Specifies that the rules
used in validating addresses supplied
to bind() should allow reuse of local
addresses, if this is supported by the
protocol. This option takes an int
value. This is a Boolean option
When should I use it? Why does "reuse of local addresses" give?
TCP's primary design goal is to allow reliable data communication in the face of packet loss, packet reordering, and — key, here — packet duplication.
It's fairly obvious how a TCP/IP network stack deals with all this while the connection is up, but there's an edge case that happens just after the connection closes. What happens if a packet sent right at the end of the conversation is duplicated and delayed, such that the 4-way shutdown packets get to the receiver before the delayed packet? The stack dutifully closes down its connection. Then later, the delayed duplicate packet shows up. What should the stack do?
More importantly, what should it do if a program with open sockets on a given IP address + TCP port combo closes its sockets, and then a brief time later, a program comes along and wants to listen on that same IP address and TCP port number? (Typical case: A program is killed and is quickly restarted.)
There are a couple of choices:
Disallow reuse of that IP/port combo for at least 2 times the maximum time a packet could be in flight. In TCP, this is usually called the 2×MSL delay. You sometimes also see 2×RTT, which is roughly equivalent.
This is the default behavior of all common TCP/IP stacks. 2×MSL is typically between 30 and 120 seconds, and it shows up in netstat output as the TIME_WAIT period. After that time, the stack assumes that any rogue packets have been dropped en route due to expired TTLs, so that socket leaves the TIME_WAIT state, allowing that IP/port combo to be reused.
Allow the new program to re-bind to that IP/port combo. In stacks with BSD sockets interfaces — essentially all Unixes and Unix-like systems, plus Windows via Winsock — you have to ask for this behavior by setting the SO_REUSEADDR option via setsockopt() before you call bind().
SO_REUSEADDR is most commonly set in network server programs, since a common usage pattern is to make a configuration change, then be required to restart that program to make the change take effect. Without SO_REUSEADDR, the bind() call in the restarted program's new instance will fail if there were connections open to the previous instance when you killed it. Those connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds, so you fall into case 1 above.
The risk in setting SO_REUSEADDR is that it creates an ambiguity: the metadata in a TCP packet's headers isn't sufficiently unique that the stack can reliably tell whether the packet is stale and so should be dropped rather than be delivered to the new listener's socket because it was clearly intended for a now-dead listener.
If you don't see that that is true, here's all the listening machine's TCP/IP stack has to work with per-connection to make that decision:
Local IP: Not unique per-conn. In fact, our problem definition here says we're reusing the local IP, on purpose.
Local TCP port: Ditto.
Remote IP: The machine causing the ambiguity could re-connect, so that doesn't help disambiguate the packet's proper destination.
Remote port: In well-behaved network stacks, the remote port of an outgoing connection isn't reused quickly, but it's only 16 bits, so you've got 30-120 seconds to force the stack to get through a few tens of thousands of choices and reuse the port. Computers could do work that fast back in the 1960s.
If your answer to that is that the remote stack should do something like TIME_WAIT on its side to disallow ephemeral TCP port reuse, that solution assumes that the remote host is benign. A malicious actor is free to reuse that remote port.
I suppose the listener's stack could choose to strictly disallow connections from the TCP 4-tuple only, so that during the TIME_WAIT state a given remote host is prevented from reconnecting with the same remote ephemeral port, but I'm not aware of any TCP stack with that particular refinement.
Local and remote TCP sequence numbers: These are also not sufficiently unique that a new remote program couldn't come up with the same values.
If we were re-designing TCP today, I think we'd integrate TLS or something like it as a non-optional feature, one effect of which is to make this sort of inadvertent and malicious connection hijacking impossible. That requires adding large fields (128 bits and up) which wasn't at all practical back in 1981, when the document for the current version of TCP (RFC 793) was published.
Without such hardening, the ambiguity created by allowing re-binding during TIME_WAIT means you can either a) have stale data intended for the old listener be misdelivered to a socket belonging to the new listener, thereby either breaking the listener's protocol or incorrectly injecting stale data into the connection; or b) new data for the new listener's socket mistakenly assigned to the old listener's socket and thus inadvertently dropped.
The safe thing to do is wait out the TIME_WAIT period.
Ultimately, it comes down to a choice of costs: wait out the TIME_WAIT period or take on the risk of unwanted data loss or inadvertent data injection.
Many server programs take this risk, deciding that it's better to get the server back up immediately so as to not miss any more incoming connections than necessary.
This is not a universal choice. Many programs — even server programs requiring a restart to apply a settings change — choose instead to leave SO_REUSEADDR alone. The programmer may know these risks and is choosing to leave the default alone, or they may be ignorant of the issues but are getting the benefit of a wise default.
Some network programs offer the user a choice among the configuration options, fobbing the responsibility off on the end user or sysadmin.
SO_REUSEADDR allows your server to
bind to an address which is in a
TIME_WAIT state.
This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port.
From unixguide.net
When you create a socket, you don't really own it. The OS (TCP stack) creates it for you and gives you a handle (file descriptor) to access it. When your socket is closed, it take time for the OS to "fully close it" while it goes through several states. As EJP mentioned in the comments, the longest delay is usually from the TIME_WAIT state. This extra delay is required to handle edge cases at the very end of the termination sequence and make sure the last termination acknowledgement either got through or had the other side reset itself because of a timeout. Here you can find some extra considerations about this state. The main considerations are pointed out as follow :
Remember that TCP guarantees all data transmitted will be delivered,
if at all possible. When you close a socket, the server goes into a
TIME_WAIT state, just to be really really sure that all the data has
gone through. When a socket is closed, both sides agree by sending
messages to each other that they will send no more data. This, it
seemed to me was good enough, and after the handshaking is done, the
socket should be closed. The problem is two-fold. First, there is no
way to be sure that the last ack was communicated successfully.
Second, there may be "wandering duplicates" left on the net that must
be dealt with if they are delivered.
If you try to create multiple sockets with the same ip:port pair really quick, you get the "address already in use" error because the earlier socket will not have been fully released. Using SO_REUSEADDR will get rid of this error as it will override checks for any previous instance.

SYN packets dropped occasionally on Linux

We're running a Debian with a 2.6.16 kernel, with iptables enabled. The system is running a custom made HTTP proxy, which is subjected to a mild load (it works fine with the same load on other sites). The system comprises of 4 servers that are preceded by a load balancer with virtual IP, which is preceded by an array of 4 ISA 2004 machines, so the basic topology is:
Client -> ISA [1-4] -> Load Balancer -> Our Proxy [1-4] -> The Internet
Occasionally, the ISA will send us a SYN packet, to which no SYN-ACK is being sent. It will try again after 3 seconds, and a third time after another 6 seconds, after which it will report the proxy down, and switch to direct connection. During this time, meaning before, in between and after those 3 SYNs, other SYNs from the same ISA come and are successfully answered to.
A very similar problem is being reported by others (with no solution, however):
All coming from a flavor of Linux called CentOS. It’s peculiarity is in having iptables enabled by default.
http://www.linuxhelpforum.com/showthread.php?t=931912&mode=linear
http://www.centos.org/modules/newbb/viewtopic.php?topic_id=16147
Almost the same: but a bit different:
http://www.linuxquestions.org/questions/linux-networking-3/tcp-handshake-fails-synack-ignored-by-system.-637171/
Also seems to be relevant:
http://groups.google.com/group/comp.os.linux.networking/browse_thread/thread/b1c000e2d65e0034
I suspect iptables to be a culprit, but any additional feedback will be welcome.
Look at the second parameter to the listen call, as mentioned in the first link you posted. It's the maximum number of pending (not accepted yet) connections. According to the listen(2) man page, if the protocol supports retransmission (TCP does), the connection request will be dropped when the queue is full (expecting a later retransmission which will create the connection if there is enough space in the queue again).
Indeed, the iptables turned out to be the culrpit, with the rule that dropped INVALID packets. We still do not know for sure what made iptables to think those SYNs were invalid (no TIME_WAIT for sure, since we did not have any traffic with the same source ports for at least 30 mins prior to the drops).

Resources