Nginx does not free properly TCP socket? - linux

I'm using Nginx and it sounds that TCP socket are not properly released by Nginx. Clients which connects to my Nginx are using a proxy and so far, the same 4-tuplets ip source, port source, ip dest, port dest could be re-used in a very short period (less than 1 minute). When it occurs, Nginx seems to be lost.
Here is what I can see in a tcpdump trace :
- FIN,ACK initiated by Nginx to close the session
- ACK from the client
- FIN,ACK from the client
- ACK for the server
If the client tries to reconnect very rapidly (less than 1 minute) with the same 4-tuplets, it fails. The client sends SYN TCP packet but Nginx replies with an ACK containing an unknown sequence (the sequence number if very high and does not make any sense with the previous TCP session).
If the same 4-tuplet is re-used after more than 1 minute, there is no problem.
Thank in advance to anyone who could have an idea to solve this problem
Aurélien

I am not familiar with Nginx, but in general, TCP sockets can remain in a TIME_WAIT state after being closed for up to several minutes in order to catch stray out-of-order packets. The 4-tuple cannot be reused until the TIME_WAIT state expires.
See:
What are the CLOSE_WAIT and TIME_WAIT states?
The TIME-WAIT state in TCP and Its Effect on Busy Servers

Related

Simultaneous TCP termination and subsequent connect(): EADDRNOTAVAIL

My company releases a special TCP stack for special purposes and I'm tasked with implementing RFC793 compliant closing sequence. One of the unit tests has a server working on top of the special TCP stack talking to a normal Linux TCP client, and I'm running into some strange behaviour that I'm not sure whether is caused by programming error on my part or is to be expected.
Before my work started, we used to send a RST packet when the user application calls close(). I've implemented the FIN handshake, but I noticed that in the case of simultaneous TCP termination (FIN_WAIT_1 -> CLOSING -> TIME_WAIT on both ends, see the picture), the standard Linux TCP client cannot connect to the same destination address and port again, with connect() returning with EADDRNOTAVAIL, until after TIME_WAIT passes into CLOSED.
Now, the standard Linux client application sets the option SO_REUSEADDR, binds the socket to port 8888 each time, and connects to destination port 6666. My question is, why does bind() succeed and why does connect() fail? I would have thought SO_REUSEADDR could take over a local TIME_WAIT port, which it did, but what does connect() have against talking to the destination-ip:6666 again?
Is my code doing something it shouldn't or is this expected behaviour?
I can confirm no SYN packet for the failed connect() makes it out of the client machine at all. I've attached a screenshot of the FIN handshake for the above session.
Your previous implementation used RST to end the connection. Receipt of an RST packet immediately removes the connection from the active connection table. That's because there is no further possibility of receiving a valid packet on that connection: the peer has just told your system that that session is not valid.
If you do a proper session termination with FIN, on the other hand, there is the last packet problem: how do you know for sure whether the peer actually received the last acknowledgment you sent to their FIN (this is the definition of TCP's TIME_WAIT state)? And if the peer didn't receive it, they might validly send another copy of the FIN packet which your machine should then re-ACK.
Now, your bind succeeds because you're using SO_REUSEADDR, but you still cannot create a new connection with the exact same ports on both sides because that entry is still in your active connection table (in TIME_WAIT state). The 4-tuple (IP1, port1, IP2, port2) must always be unique.
As #EJP suggested in the comment, it is unusual for the client to specify a port, and there is typically no reason to. I would change your test.

TCP retranmission timer overrides/kills TCP keepalive timer, delaying disconnect discovery

Machine - linux, 3.10.19 kernel
This is in a large distributed system, there are several servers and clients (on same as well as different nodes/machines) having TCP connections with each other.
Test case:
The client program node/machine is switched off (on purpose, test case) and the only way for server to know about his disconnection is via keepalive timer (idle time=40 sec, 4 probes, probe time=10 sec).
Good case:
This works fine in most of the cases, the server gets to know that the client has gone down in [40,70] sec.
Bad case:
But I am hitting another unique situation where while keepalive timer is running, the server tries sending some data to the client, and this in turn starts the TCP retransmission timer which overrides/kills the keepalive timer. It takes ~15 min for the retransmission timer to detect that the other end is not there anymore.
15 min is a lot of time for server to realize this. I am looking for ways how others handle such a situation. Do I need to tweak my retransmission timer values?
Thanks!
There is a completely separate configuration for retransmission timeouts.
From Linux's tcp.7 man page:
tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted in
established state before giving up. The default value is 15, which
corresponds to a duration of approximately between 13 to 30 minutes,
depending on the retransmission timeout. The RFC 1122 specified
minimum limit of 100 seconds is typically deemed too short.
This is likely the value you'll want to adjust to change how long it takes to detect if your connection has vanished.
I have the same issue with a linux kernel release 4.3.0-1-amd64:
I used a server and a client, connected to the same switch.
TCP keep-alive mecanism works correctly for client and server in the following cases:
When no message is sent between the cable disconnection and the socket disconnection (by the tcp keep-alive mecanism).
When the cable is disconnected between the client/server and the switch (which sets the link state to down) even if the client/server application tries to send a message.
When the wire is unpluged on the other side of the switch, TCP Keep-Alive frames are transmitted until an applicative message is sent. Then, TCP Retransmission frames are sent and TCP keep-alive frames stop being sent, which prevents the socket to be closed.

What does the ESTABLISHED indicator mean after running lsof

I ran the command sudo lsof -i -n -P | grep TCP and I was wondering if I could get some more clarification on its output.
Specifically, in this image:
Why do I have an IP:PORT pointing to another IP:PORT and then back at itself with the label 'ESTABLISHED'? I am confused on what this means exactly.
I'm not sure how familiar you are with networking and TCP in general, so I'll try to provide a brief description with a couple of details. From your question, it appears that you're not very familiar with networking internals, so it may be hard to understand some of these concepts, but I hope this helps:
The TCP protocol has various states. Think of it as a state machine. States on the client side include CLOSED, SYN_SENT, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2 and TIME_WAIT.
Thus, the ESTABLISHED label means that the TCP connection is in the ESTABLISHED state. Being in the established state means that both hosts successfully completed the TCP 3-way handshake (and in doing so, transitioned from SYN_SENT to ESTABLISHED). The transition from CLOSED to SYN_SENT happens when the client side sends the TCP SYN request to the server.
In an established connection, both sides transmit and receive application specific data. Basically, a session is established and a bidirectional stream of bytes flows between the two end systems.
TCP sockets are uniquely identified by the 4-tuple (source-ip, source-port, destination-ip, destination-port). The IP identifies an end system's network interface, and the port number is used to multiplex and demultiplex packet arrival at that network interface (so that the target system knows which service to deliver the packets to). That's the meaning of the IP:PORT fields.
I'm not sure why you have two entries for the same connection. This might be system-dependent, although it's odd (in my system I get only one entry per socket). But sockets are bidirectional, so it may be the case that your system shows you each packet flow direction as a distinct entry. This might also depend on how the system implements sockets.
ESTABLISHED means that the TCP connection has completed the 3-way handshake. (Not sure though whether accept must have been called). See TCP state diagram.
Why do I have an IP:PORT pointing to another IP:PORT and then back at itself
That mean you have two TCP sockets open in your process. Most likely, one listens on port 9092, and another one that connected from port 57633 to that listening socket. Port 57633 belongs to the ephemeral port range, i.e. the range of ports that the OS automatically assigns to the sockets that call connect but did not call bind to assign a specific port.

linux doesn't detect dead tcp connections

After restarting my server side application, my client side OS doesn't detect dead tcp connections. The zombie connections will stay in established state, and never be closed by OS. Is anyone hava any idea about this?
This is the server side connections on port 9888:
This is the client side connections to the server:
Some information of my OS:
You might want to use TCP keep alive mechanism to detect the dead peers. As rightly mentioned in the comments you need to call set following socket options using setsockopt function,
SO_KEEPALIVE - To enable/disable the TCP keep alive mechanism
TCP_KEEPIDLE - IDLE time (in seconds) after which TCP starts sending keepalive probes
TCP_KEEPCNT - Maximum number of keepalive probes TCP should send before dropping the connection
TCP_KEEPINTVL - The time (in seconds) between individual keepalive probes
So for example if you set ideal time = 60 seconds, cnt = 5 and interval = 2 seconds, the system will drop the connection after 70 seconds of inactivity.
More details are available at following website
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
Hope this helps.

How to get BACKLOG of listening socket

I have a listening socket on port 80 on ubuntu linux.
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 12248/nginx
Is there any way to get backlog value of that socket (backlog value that was sent to listen() call)?
I know that I could view the nginx configuration but configuration file could be changed without reloading nginx with new configuration, so the backlog argument in configuration and in actual LISTEN call could be different.
ss -lt gives this value in the Send-Q column.
id use the current backlog information to manage the number of connections received BECAUSE I can respond to the incoming connections and tell the sender to modify their connection interval, thus reducing (or increasing) load. I cannot control how many incoming connections I get but I can control how frequently they connect, hence keeping the backlog down and preventing timeouts on incoming connections.
in my case this happens to be a feature of the incoming connection source firmware, so it might be unique to my situation and not relevant to others.
There is no standard TCP API for getting the backlog. There is also no reason to need it. You created the socket, you put it into listen state, you should know what backlog you specified. Th system is entitled to adjust it up or down, but even then there is nothing useful you can do you with that information in your application.

Resources