Simultaneous TCP termination and subsequent connect(): EADDRNOTAVAIL - linux

My company releases a special TCP stack for special purposes and I'm tasked with implementing RFC793 compliant closing sequence. One of the unit tests has a server working on top of the special TCP stack talking to a normal Linux TCP client, and I'm running into some strange behaviour that I'm not sure whether is caused by programming error on my part or is to be expected.
Before my work started, we used to send a RST packet when the user application calls close(). I've implemented the FIN handshake, but I noticed that in the case of simultaneous TCP termination (FIN_WAIT_1 -> CLOSING -> TIME_WAIT on both ends, see the picture), the standard Linux TCP client cannot connect to the same destination address and port again, with connect() returning with EADDRNOTAVAIL, until after TIME_WAIT passes into CLOSED.
Now, the standard Linux client application sets the option SO_REUSEADDR, binds the socket to port 8888 each time, and connects to destination port 6666. My question is, why does bind() succeed and why does connect() fail? I would have thought SO_REUSEADDR could take over a local TIME_WAIT port, which it did, but what does connect() have against talking to the destination-ip:6666 again?
Is my code doing something it shouldn't or is this expected behaviour?
I can confirm no SYN packet for the failed connect() makes it out of the client machine at all. I've attached a screenshot of the FIN handshake for the above session.

Your previous implementation used RST to end the connection. Receipt of an RST packet immediately removes the connection from the active connection table. That's because there is no further possibility of receiving a valid packet on that connection: the peer has just told your system that that session is not valid.
If you do a proper session termination with FIN, on the other hand, there is the last packet problem: how do you know for sure whether the peer actually received the last acknowledgment you sent to their FIN (this is the definition of TCP's TIME_WAIT state)? And if the peer didn't receive it, they might validly send another copy of the FIN packet which your machine should then re-ACK.
Now, your bind succeeds because you're using SO_REUSEADDR, but you still cannot create a new connection with the exact same ports on both sides because that entry is still in your active connection table (in TIME_WAIT state). The 4-tuple (IP1, port1, IP2, port2) must always be unique.
As #EJP suggested in the comment, it is unusual for the client to specify a port, and there is typically no reason to. I would change your test.

Related

TCP server ignores incoming SYN

I have a tcp server running. A client connects to the server and send packet periodically. For TCP server, this incoming connections turns to be CONNECTED, and the server socket still listens for other connections.
Say this client suddenly get powered off, no FIN sent to server. When it powers up again, it still use the same port to connect, but server doesn't reply to SYNC request. It just ignores incoming request, since there exists a connection with this port.
How to let server close the old connection and accept new one?
My tcp server runs on Ubuntu 14.04, it's a Java program using ServerSocket.
That's not correct, a server can accept multiple connections and will accept a new connection from a rebooted client as long as it's connecting from a different port (and that's usually the case). If your program is not accepting it it's because you haven't called accept() a second time. This probably means that your application is only handling one blocking operation per time (for example, it might be stuck in a read() operation on the connected socket). The solution for this is to simultaneously read from the connected sockets and accept new connections. This might be done using an I/O multiplexer, like select(), or multiple threads.

Given one established tcp connection from client to server, is a second one resolved without server explicitly calling accept?

Apologies in advance if my terminology is very rudimentary:
I am working with a client that establishes a tcp connection to a server. The client's socket is nonblocking, so after calling connect(), the client waits for the socket to become writable.
Upon accept()ing the connection from the client, the server performs a blocking operation (call it function X) and does not return to blocking at accept() for a long time.
During this time that the server is occupied performing function X, the client does another connect() to the same server, again using a nonblocking socket (different than the socket used with the first connection), then waiting for the socket to become writable in order to consider the tcp connection as "established."
I naively expected the second socket to remain not-writable until the server called accept() a second time to accept that second connection. But I've observed this as not the case: the second socket becomes writable quickly, so the client again considers this new tcp connection as "established."
Is this expected?
From one of the comments at this question, I (very loosely) understand that nonblocking sockets that are in the middle of a tcp connect will remain not-writable for the duration that TCP handshaking is being performed - is this true? Does this relate to the above question? Is it something like: if there is an existing tcp connection from a client to a server, then subsequent tcp connections from that same client to that same server are immediately/quickly "resolved" (the socket becomes writable without the server explicitly performing a second accept)?
What I tried:
I tried writing up a unit test to simulate this scenario with one thread each for client and server running on a single PC, but I think this is not a valid way to test: per this Q&A I think if client and server are on the same PC, "TCP handshaking" is not quite the same as with two separate PCs, and for example, the client's connecting socket becomes writable without the server even listening let alone accepting the connection.
Every connect() needs a corresponding accept() in order for client and server to communicate with each other.
However, it is possible/likely that the 3-way TCP handshake maybe/is completed while the connection is still in the server's backlog, before accept() creates a new socket for it. Once the handshake is complete, the connection is "established", and that will complete the connect() operation on the client's side, even if the connection has not been accept()ed yet on the server side.
See How TCP backlog works in Linux

Can TCP RST packet reduce the connection timeout?

As a way to learn how raw sockets work, I programmed a dummy firewall which drops the packets based on the TCP destination port. It is working but the problem is that the client retries for quite some time until the time out is finally reached.
I was wondering if perhaps the client retries for so long because it does not receive any answer. In that case, would it help if the firewall replies with a TCP RST to the TCP SYNC messages from the client? If not, is there any way to force the client to stop retrying (not reducing the timeout time in the Linux but more, getting a specific answer to its packets which will make the client stop)?
You can think of your firewall as the same case as if the port were closed on the host OS. What would the host OS's TCP/IP stack do?
RFC 793 (original TCP RFC) has the following to say about this case:
If the connection does not exist (CLOSED) then a reset is sent
in response to any incoming segment except another reset. In
particular, SYNs addressed to a non-existent connection are rejected
by this means.
You should read the TCP RFCs and make sure your TCP RST packet conforms to the requirements for this case. A malformed RST will be ignored by the client.
RFC 1122 also indicates that ICMP Destination Unreachable containing codes 2-4 should cause an abort in the connection. It's important to note the codes because 0, 1, and 5 are listed as a MUST NOT for aborting the connection
Destination Unreachable -- codes 2-4
These are hard error conditions, so TCP SHOULD abort
the connection.
Your firewall is behaving correctly. It is a cardinal principle of information scurity not to disclose any information to the attacker. Sending an RST would disclose that the host exists.
There were firewalls that did that 15-20 years ago but there were frowned on. Nowadays they behave like yours: they just drop the packet and do nothing in reply.
It is normal for the client to retry a few times before giving up if there is no response, but contrary to what you have been told in comments, a client will give up immediately with 'connection refused' if it receives an RST. It only retries if there is no response at all.

What does the ESTABLISHED indicator mean after running lsof

I ran the command sudo lsof -i -n -P | grep TCP and I was wondering if I could get some more clarification on its output.
Specifically, in this image:
Why do I have an IP:PORT pointing to another IP:PORT and then back at itself with the label 'ESTABLISHED'? I am confused on what this means exactly.
I'm not sure how familiar you are with networking and TCP in general, so I'll try to provide a brief description with a couple of details. From your question, it appears that you're not very familiar with networking internals, so it may be hard to understand some of these concepts, but I hope this helps:
The TCP protocol has various states. Think of it as a state machine. States on the client side include CLOSED, SYN_SENT, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2 and TIME_WAIT.
Thus, the ESTABLISHED label means that the TCP connection is in the ESTABLISHED state. Being in the established state means that both hosts successfully completed the TCP 3-way handshake (and in doing so, transitioned from SYN_SENT to ESTABLISHED). The transition from CLOSED to SYN_SENT happens when the client side sends the TCP SYN request to the server.
In an established connection, both sides transmit and receive application specific data. Basically, a session is established and a bidirectional stream of bytes flows between the two end systems.
TCP sockets are uniquely identified by the 4-tuple (source-ip, source-port, destination-ip, destination-port). The IP identifies an end system's network interface, and the port number is used to multiplex and demultiplex packet arrival at that network interface (so that the target system knows which service to deliver the packets to). That's the meaning of the IP:PORT fields.
I'm not sure why you have two entries for the same connection. This might be system-dependent, although it's odd (in my system I get only one entry per socket). But sockets are bidirectional, so it may be the case that your system shows you each packet flow direction as a distinct entry. This might also depend on how the system implements sockets.
ESTABLISHED means that the TCP connection has completed the 3-way handshake. (Not sure though whether accept must have been called). See TCP state diagram.
Why do I have an IP:PORT pointing to another IP:PORT and then back at itself
That mean you have two TCP sockets open in your process. Most likely, one listens on port 9092, and another one that connected from port 57633 to that listening socket. Port 57633 belongs to the ephemeral port range, i.e. the range of ports that the OS automatically assigns to the sockets that call connect but did not call bind to assign a specific port.

What is the meaning of SO_REUSEADDR (setsockopt option) - Linux? [duplicate]

This question already has answers here:
How do SO_REUSEADDR and SO_REUSEPORT differ?
(2 answers)
Closed 9 years ago.
From the man page:
SO_REUSEADDR Specifies that the rules
used in validating addresses supplied
to bind() should allow reuse of local
addresses, if this is supported by the
protocol. This option takes an int
value. This is a Boolean option
When should I use it? Why does "reuse of local addresses" give?
TCP's primary design goal is to allow reliable data communication in the face of packet loss, packet reordering, and — key, here — packet duplication.
It's fairly obvious how a TCP/IP network stack deals with all this while the connection is up, but there's an edge case that happens just after the connection closes. What happens if a packet sent right at the end of the conversation is duplicated and delayed, such that the 4-way shutdown packets get to the receiver before the delayed packet? The stack dutifully closes down its connection. Then later, the delayed duplicate packet shows up. What should the stack do?
More importantly, what should it do if a program with open sockets on a given IP address + TCP port combo closes its sockets, and then a brief time later, a program comes along and wants to listen on that same IP address and TCP port number? (Typical case: A program is killed and is quickly restarted.)
There are a couple of choices:
Disallow reuse of that IP/port combo for at least 2 times the maximum time a packet could be in flight. In TCP, this is usually called the 2×MSL delay. You sometimes also see 2×RTT, which is roughly equivalent.
This is the default behavior of all common TCP/IP stacks. 2×MSL is typically between 30 and 120 seconds, and it shows up in netstat output as the TIME_WAIT period. After that time, the stack assumes that any rogue packets have been dropped en route due to expired TTLs, so that socket leaves the TIME_WAIT state, allowing that IP/port combo to be reused.
Allow the new program to re-bind to that IP/port combo. In stacks with BSD sockets interfaces — essentially all Unixes and Unix-like systems, plus Windows via Winsock — you have to ask for this behavior by setting the SO_REUSEADDR option via setsockopt() before you call bind().
SO_REUSEADDR is most commonly set in network server programs, since a common usage pattern is to make a configuration change, then be required to restart that program to make the change take effect. Without SO_REUSEADDR, the bind() call in the restarted program's new instance will fail if there were connections open to the previous instance when you killed it. Those connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds, so you fall into case 1 above.
The risk in setting SO_REUSEADDR is that it creates an ambiguity: the metadata in a TCP packet's headers isn't sufficiently unique that the stack can reliably tell whether the packet is stale and so should be dropped rather than be delivered to the new listener's socket because it was clearly intended for a now-dead listener.
If you don't see that that is true, here's all the listening machine's TCP/IP stack has to work with per-connection to make that decision:
Local IP: Not unique per-conn. In fact, our problem definition here says we're reusing the local IP, on purpose.
Local TCP port: Ditto.
Remote IP: The machine causing the ambiguity could re-connect, so that doesn't help disambiguate the packet's proper destination.
Remote port: In well-behaved network stacks, the remote port of an outgoing connection isn't reused quickly, but it's only 16 bits, so you've got 30-120 seconds to force the stack to get through a few tens of thousands of choices and reuse the port. Computers could do work that fast back in the 1960s.
If your answer to that is that the remote stack should do something like TIME_WAIT on its side to disallow ephemeral TCP port reuse, that solution assumes that the remote host is benign. A malicious actor is free to reuse that remote port.
I suppose the listener's stack could choose to strictly disallow connections from the TCP 4-tuple only, so that during the TIME_WAIT state a given remote host is prevented from reconnecting with the same remote ephemeral port, but I'm not aware of any TCP stack with that particular refinement.
Local and remote TCP sequence numbers: These are also not sufficiently unique that a new remote program couldn't come up with the same values.
If we were re-designing TCP today, I think we'd integrate TLS or something like it as a non-optional feature, one effect of which is to make this sort of inadvertent and malicious connection hijacking impossible. That requires adding large fields (128 bits and up) which wasn't at all practical back in 1981, when the document for the current version of TCP (RFC 793) was published.
Without such hardening, the ambiguity created by allowing re-binding during TIME_WAIT means you can either a) have stale data intended for the old listener be misdelivered to a socket belonging to the new listener, thereby either breaking the listener's protocol or incorrectly injecting stale data into the connection; or b) new data for the new listener's socket mistakenly assigned to the old listener's socket and thus inadvertently dropped.
The safe thing to do is wait out the TIME_WAIT period.
Ultimately, it comes down to a choice of costs: wait out the TIME_WAIT period or take on the risk of unwanted data loss or inadvertent data injection.
Many server programs take this risk, deciding that it's better to get the server back up immediately so as to not miss any more incoming connections than necessary.
This is not a universal choice. Many programs — even server programs requiring a restart to apply a settings change — choose instead to leave SO_REUSEADDR alone. The programmer may know these risks and is choosing to leave the default alone, or they may be ignorant of the issues but are getting the benefit of a wise default.
Some network programs offer the user a choice among the configuration options, fobbing the responsibility off on the end user or sysadmin.
SO_REUSEADDR allows your server to
bind to an address which is in a
TIME_WAIT state.
This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port.
From unixguide.net
When you create a socket, you don't really own it. The OS (TCP stack) creates it for you and gives you a handle (file descriptor) to access it. When your socket is closed, it take time for the OS to "fully close it" while it goes through several states. As EJP mentioned in the comments, the longest delay is usually from the TIME_WAIT state. This extra delay is required to handle edge cases at the very end of the termination sequence and make sure the last termination acknowledgement either got through or had the other side reset itself because of a timeout. Here you can find some extra considerations about this state. The main considerations are pointed out as follow :
Remember that TCP guarantees all data transmitted will be delivered,
if at all possible. When you close a socket, the server goes into a
TIME_WAIT state, just to be really really sure that all the data has
gone through. When a socket is closed, both sides agree by sending
messages to each other that they will send no more data. This, it
seemed to me was good enough, and after the handshaking is done, the
socket should be closed. The problem is two-fold. First, there is no
way to be sure that the last ack was communicated successfully.
Second, there may be "wandering duplicates" left on the net that must
be dealt with if they are delivered.
If you try to create multiple sockets with the same ip:port pair really quick, you get the "address already in use" error because the earlier socket will not have been fully released. Using SO_REUSEADDR will get rid of this error as it will override checks for any previous instance.

Resources