TCP close() vs shutdown() in Linux OS - linux

I know there are already a lot similar questions in stackoverflow, but nothing seems convincing. Basically trying to understand under what circumstances I need to use one over the other or use both.
Also would like to understand if close() & shutdown() with shut_rdwr are the same.

Closing TCP connections has gathered so much confusion that we can rightfully say either this aspect of TCP has been poorly designed, or is lacking somewhere in documentation.
Short answer
To do it the proper way, you should use all 3: shutdown(SHUT_WR), shutdown(SHUT_RD) and close(), in this order. No, shutdown(SHUT_RDWR) and close() are not the same. Read their documentation carefully and questions on SO and articles about it, you need to read more of them for an overview.
Longer answer
The first thing to clarify is what you aim for, when closing a connection. Presumably you use TCP for a higher lever protocol (request-response, steady stream of data etc.). Once you decide to "close" (terminate) connection, all you had to send/receive, you sent and received (otherwise you would not decide to terminate) - so what more do you want? I'm trying to outline what you may want at the time of termination:
to know that all data sent in either direction reached the peer
if there are any errors (in transmitting the data in process of being sent when you decided to terminate, as well as after that, and in doing the termination itself - which also requires data being sent/received), the application is informed
optionally, some applications want to be non-blocking up to and including the termination
Unfortunately TCP doesn't make these features easily available, and the user needs to understand what's under the hood and how the system calls interact with what's under the hood. A key sentence is in the recv manpage:
When a stream socket peer has performed an orderly shutdown, the
return value will be 0 (the traditional "end-of-file" return).
What the manpage means here is, orderly shutdown is done by one end (A) choosing to call shutdown(SHUT_WR), which causes a FIN packet to be sent to the peer (B), and this packet takes the form of a 0 return code from recv inside B. (Note: the FIN packet, being an implementation aspect, is not mentioned by the manpage). The "EOF" as the manpage calls it, means there will be no more transmission from A to B, but application B can, and should continue to send what it was in the process of sending, and even send some more, potentially (A is still receiving). When that sending is done (shortly), B should itself call shutdown(SHUT_WR) to close the other half of the duplex. Now app A receives EOF and all transmission has ceased. The two apps are OK to call shutdown(SHUT_RD) to close their sockets for reading and then close() to free system resources associated with the socket (TODO I haven't found clear documentation taht says the 2 calls to shutdown(SHUT_RD) are sending the ACKs in the termination sequence FIN --> ACK, FIN --> ACK, but this seems logical).
Onwards to our aims, for (1) and (2) basically the application must somehow wait for the shutdown sequence to happen, and observe its outcome. Notice how if we follow the small protocol above, it is clear to both apps that the termination initiator (A) has sent everything to B. This is because B received EOF (and EOF is received only after everything else). A also received EOF, which is issued in reply to its own EOF, so A knows B received everything (there is a caveat here - the termination protocol must have a convention of who initiates the termination - so not both peers do so at once). However, the reverse is not true. After B calls shutdown(SHUT_WR), there is nothing coming back app-level, to tell B that A received all data sent, plus the FIN (the A->B transmission had ceased!). Correct me if I'm wrong, but I believe at this stage B is in state "LAST_ACK" and when the final ACK arrives (step #4 of the 4-way handshake), concludes the close but the application is not informed unless it had set SO_LINGER with a long-enough timeout. SO_LINGER "ON" instructs the shutdown call to block (be performed in the forground) hence the shutdown call itself will do the waiting.
In conclusion what I recommend is to configure SO_LINGER ON with a long timeout, which causes it to block and hence return any errors. What is not entirely clear is whether it is shutdown(SHUT_WR) or shutdown(SHUT_RD) which blocks in expectation of the LAST_ACK, but that is of less importance as we need to call both.
Blocking on shutdown is problematic for requirement #3 above where e.g. you have a single-threaded design that serves all connections. Using SO_LINGER may block all connections on the termination of one of them. I see 3 routes to address the problem:
shutdown with LINGER, from a different thread. This will of course complicate a design
linger in background and either
2A. "Promote" FIN and FIN2 to app-level messages which you can read and hence wait for. This basically moves the problem that TCP was meant to solve, one level higher, which I consider hack-ish, also because the ensuing shutdown calls may still end in a limbo.
2B. Try to find a lower-level facility such as SIOCOUTQ ioctl described here that queries number of unACKed bytes in the network stack. The caveats are many, this is Linux specific and we are not sure if it aplies to FIN ACKs (to know whether closing is fully done), plus you'd need to poll taht periodically, which is complicated. Overall I'm leaning towards option 1.
I tried to write a comprehensive summary of the issue, corrections/additions welcome.

TCP sockets are bidirectional - you send and receive over the one socket. close() stops communication in both directions. shutdown() provides another parameter that allows you to specify which direction you might want to stop using.
Another difference (between close() and shutdown(rw)) is that close() will keep the socket open if another process is using it, while shutdown() shuts down the socket irrespective of other processes.
shutdown() is often used by clients to provide framing - to indicate the end of their request, e.g. an echo service might buffer up what it receives until the client shutdown()s their send side, which tells the server that the client has finished, and the server then replies; the client can receive the reply because it has only shutdown() writing, not reading, through its socket.

Close will close both send and receving end of socket.If you want only sending part of socket should be close not receving part or vice versa you can use shutdown.
close()------->will close both sending and receiving end.
shutdown()------->only want to close sending or receiving.
argument:SHUT_RD(shutdown reading end (receiving end))
SHUT_WR(shutdown writing end(sending end))
SHUT_RDWR(shutdown both)

Related

What really is the "linger time" that can be set with SO_LINGER on sockets?

The man page explains little to nothing about that option and while there are tons of information available on the web and in answers on StackOverflow, I discovered that many of the information provided there even contradicts itself. So what's that setting really good for and why would I need to set or alter it?
When a TCP socket is disconnected, there are three things the system has to consider:
There might still be unsent data in the send-buffer of that socket which would get lost if the socket is closed immediately.
There might still be data in flight, that is, data has already been sent out to the other side but the other side has not yet acknowledged to have received that data correctly and it may have to be resent or otherwise is lost.
Closing a TCP socket is a three-way handshake with no confirmation of the third packet. As the sender doesn't know if the third packet has ever arrived, it has to wait some time and see if the second one gets resend. If it does, the third one has been lost and must be resent.
When you close a socket using the close() call, the system will usually not immediately destroy the socket but will first try to resolve all the three issues above to prevent data loss and ensure a clean disconnect. All of that happens in the background (usually within the operating system kernel), so despite the close() call returning immediately, the socket may still be alive for a while and even send out remaining data. There is a system specific upper time bound how long the system will try to get a clean disconnect before it will eventually give up and destroy the socket anyway, even if that means that data is lost. Note that this time limit can be in the range of minutes!
There is a socket option named SO_LINGER that controls how the system will close a socket. You can turn lingering on or off using that option and if is turned on, set a timeout (you can set a timeout also if turned off but that timeout has no effect).
The default is that lingering is turned off, which means close() returns immediately and the details of the socket closing process are left up to the system which will usually deal with it as described above.
If you turn lingering on and set a timeout other than zero, close() will not return immediately. It will only return if issue (1) and (2) have been resolved (all data has been sent, no data is in flight anymore) or if that timeout has been hit. Which of both was the case can be seen by the result of the close call. If it is success, all remaining data got sent and acknowledged, if it is failure and errno is set to EWOULDBLOCK, the timeout has been hit and some data might have been lost.
In case of a non-blocking socket, close() will not block, not even with a linger time other than zero. In that case there is no way to get the result of the close operation as you cannot ever call close() twice on the same socket. Even if the socket is lingering, once close returned, the socket file descriptor should have been invalidated and calling close again with that descriptor should result in a failure with errno set to EBADF ("bad file descriptor").
However, even if you set linger time to something really short, like one second and the socket won't linger for longer than one second, it will still stay around for a while after lingering to deal with issue (3) above. To ensure a clean disconnect, the implementation must ensure that the other side also has disconnected that connection, otherwise remaining data may still arrive for that already dead connection. So the socket will go into a state most systems call TIME_WAIT and stay in that state for a system specific amount of time, regardless if lingering is on and regardless what linger time has been set.
Except for one special case: If you enable lingering but set the linger time to zero, this changes pretty much everything. In that case a call to close() will really close the socket immediately. That means no matter if the socket is blocking or non-blocking, close() returns at once. Any data still in the send buffer is just discarded. Any data in flight is ignored and may or may not have arrived correctly at the other side. And the socket is also not closed using a normal TCP close handshake (FIN-ACK), it is killed instantly using a reset (RST). As a result, if the other side tries to send something over the socket after the reset, this operation will fail with ECONNRESET ("A connection was forcibly closed by the peer."), whereas a normal close would result in EPIPE ("The socket is no longer connected."). While most programs will treat EPIPE as a harmless event, they tend to treat ECONNRESET as a hard error if they didn't expect that to happen.
Please note that this describes the socket behavior as found in the original BSD socket implementation (original means that this may not even match the behavior of modern BSD implementations such as FreeBSD, OpenBSD, NetBSD, etc.). While the BSD socket API has been copied by pretty much all other major operating systems today (Linux, Android, Windows, macOS, iOS, etc.), the behavior on these systems sometimes varies, as is also true with many other aspects of that API.
E.g. If a non-blocking socket with data in the send buffer is closed on BSD, linger is on and linger time is not zero, the close call will return at once but it will indicate a failure and the error will be EWOULDBLOCK (just like in case of a blocking socket after the linger timeout has been hit). Same holds true for Windows. On macOS this is not the case, close() will always return at once and indicate success, regardless of data in the send buffer or not. And in case of Linux, the close() call will actually block in that case up to the linger timeout, despite the socket being non-blocking.
To learn more about how different systems actually deal with different linger settings, have a look at the following two links (please note that the TLS certificates have been expired and this site is only available via HTTPS; sorry about that but it's the best information currently available):
For blocking sockets:
https://www.nybek.com/blog/2015/03/05/cross-platform-testing-of-so_linger/
For non-blocking sockets:
https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/
As you can see, the behavior might also change depending on whether shutdown() has been called prior to close() and other system specific aspects, including things like setting a lingering timeout will have an effect despite lingering being turned off completely.
Another system specific behavior is what happens if your processes dies without closing a socket first. In that case the system will close the socket on your behalf and some systems tend to ignore any linger setting when they have to do so and just fall back to the system's default behavior. They cannot "block" on socket close in that case anyway but some systems will even ignore a timeout of zero and do a FIN-ACK in that case.
So it's not true that setting a linger timeout of zero will prevent sockets from ever entering the TIME_WAIT state. It depends on how the socket has been closed (shutdown(), close()), by whom it has been closed (your own code or the system), whether it was blocking or non-blocking, and ultimately, on the system your code is running on. The only true statement that can be made is:
If you manually close a socket that is blocking (at least the moment you close it, might have been non-blocking before) and this socket has lingering enabled with timeout of zero, this is your best chance to avoid that this socket will go into TIME_WAIT state. There is no guarantee it won't but if that won't prevent it from happening, there is nothing else you could do to prevent it from happening, unless you have a way to ensure that the peer on the other side will initiate the close for you; as only the side initiating the close operation may end up in a TIME_WAIT state.
So my personal pro tip is: If you design a sever-client-protocol, design it in such a way that normally the client closes the connection first because it is very undesirable that server sockets typically end up in TIME_WAIT state but it's even more undesirable that connections are closed by RST as that can lead to data loss of data previously sent to the client.

Linux: send whole message or none of it on TCP socket

I'm sending various custom message structures down a nonblocking TCP socket. I want to send either the whole structure in one send() call, or return an error with no bytes sent if there's only room in the send buffer for part of the message (ie send() returns EWOULDBLOCK). If there's not enought room, I will throw away the whole structure and report overflow, but I want to be recoverable after that, ie the receiver only ever receives a sequence of valid complete structures. Is there a way of either checking the send buffer free space, or telling the send() call to do as described? Datagram-based sockets aren't an option, must be connection-based TCP. Thanks.
Linux provides a SIOCOUTQ ioctl() to query how much data is in the TCP output buffer:
http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html
You can use that, plus the value of SO_SNDBUF, to determine whether the outgoing buffer has enough space for any particular message. So strictly speaking, the answer to your question is "yes".
But there are two problems with this approach. First, it is Linux-specific. Second, what are you planning to do when there is not enough space to send your whole message? Loop and call select again? But that will just tell you the socket is ready for writing again, causing you to busy-loop.
For efficiency's sake, you should probably bite the bullet and just deal with partial writes; let the network stack worry about breaking your stream up into packets for optimal throughput.
TCP has no support for transactions; this is something which you must handle on layer 7 (application).

TCP Message framing + recv() [linux]: Good conventions?

I am trying to create a p2p applications on Linux, which I want to run as efficiently as possible.
The issue I have is with managing packets. As we know, there may be more than one packet in the recv() buffer at any time, so there is a need to have some kind of message framing system to make sure that multiple packets are not treated as one big packet.
So at the moment my packet structure is:
(u16int Packet Length):(Packet Data)
Which requires two calls to recv(); one to get the packet size, and one to get the packet.
There are two main problems with this:
1. A malicious peer could send a packet with a size header of
something large, but not send any more data. The application will
hang on the second recv(), waiting for data that will never come.
2. Assuming that calling Recv() has a noticeable performance penalty
(I actually have no idea, correct me if I am wrong) calling Recv() twice
will slow the program down.
What is the best way to structure packets/Recieving system for both the best efficiency and stability? How do other applications do it? What do you recommend?
Thankyou in advance.
I think your "framing" of messages within a TCP stream is right on.
You could consider putting a "magic cookie" in front of each frame (e.g. write the 32-bit int "0xdeadbeef" at the top of each frame header in addition to the packet length) such that it becomes obvious that your are reading a frame header on the first of each recv() pairs. It the magic integer isn't present at the start of the message, you have gotten out of sync and need to tear the connection down.
Multiple recv() calls will not likely be a performance hit. As a matter of fact, because TCP messages can get segmented, coalesced, and stalled in unpredictable ways, you'll likely need to call recv() in a loop until you get all the data you expected. This includes your two byte header as well as for the larger read of the payload bytes. It's entirely possible you call "recv" with a 2 byte buffer to read the "size" of the message, but only get 1 byte back. (Call recv again, and you'll get the subsequent bytes). What I tell the developers on my team - code your network parsers as if it was possible that recv only delivered 1 byte at a time.
You can use non-blocking sockets and the "select" call to avoid hanging. If the data doesn't arrive within a reasonable amount of time (or more data arrives than expected - such that syncing on the next message becomes impossible), you just tear the connection down.
I'm working on a P2P project of my own. Would love to trade notes. Follow up with me offline if you like.
I disagree with the others, TCP is a reliable protocol, so a packet magic header is useless unless you fear that your client code isn't stable or that unsolicited clients connect to your port number.
Create a buffer for each client and use non-blocking sockets and select/poll/epoll/kqueue. If there is data available from a client, read as much as you can, it doesn't matter if you read more "packets". Then check whether you've read enough so the size field is available, if so, check that you've read the whole packet (or more). If so, process the packet. Then if there's more data, you can repeat this procedure. If there is partial packet left, you can move that to the start of your buffer, or use a circular buffer so you don't have to do those memmove-s.
Client timeout can be handled in your select/... loop.
That's what I would use if you're doing something complex with the received packet data. If all you do is to write the results to a file (in bigger chunks) then sendfile/splice yields better peformance. Just read packet length (could be multiple reads) then use multiple calls to sendfile until you've read the whole packet (keep track of how much left to read).
You can use non-blocking calls to recv() (by setting SOCK_NONBLOCK on the socket), and wait for them to become ready for reading data using select() (with a timeout) in a loop.
Then if a file descriptor is in the "waiting for data" state for too long, you can just close the socket.
TCP is a stream-oriented protocol - it doesn't actually have any concept of packets. So, in addition to recieving multiple application-layer packets in one recv() call, you might also recieve only part of an application-layer packet, with the remainder coming in a future recv() call.
This implies that robust reciever behaviour is obtained by receiving as much data as possible at each recv() call, then buffering that data in an application-layer buffer until you have at least one full application-layer packet. This also avoids your two-calls-to-recv() problem.
To always recieve as much data as possible at each recv(), without blocking, you should use non-blocking sockets and call recv() until it returns -1 with errno set to EWOULDBLOCK.
As others said, a leading magic number (OT: man file) is a good (99.999999%) solution to identify datagram boundaries, and timeout (using non-blocking recv()) is good for detecting missing/late packet.
If you count on attackers, you should put a CRC in your packet. If a professional attacker really wants, he/she will figure out - sooner or later - how your CRC works, but it's even harder than create a packet without CRC. (Also, if safety is critical, you will find SSL libs/examples/code on the Net.)

How do I use EPOLLHUP

Could you guys provide me a good sample code using EPOLLHUP for dead peer handling? I know that it is a signal to detect a user disconnection but not sure how I can use this in code..Thanks in advance..
You use EPOLLRDHUP to detect peer shutdown, not EPOLLHUP (which signals an unexpected close of the socket, i.e. usually an internal error).
Using it is really simple, just "or" the flag with any other flags that you are giving to epoll_ctl. So, for example instead of EPOLLIN write EPOLLIN|EPOLLRDHUP.
After epoll_wait, do an if(my_event.events & EPOLLRDHUP) followed by whatever you want to do if the other side closed the connection (you'll probably want to close the socket).
Note that getting a "zero bytes read" result when reading from a socket also means that the other end has shut down the connection, so you should always check for that too, to avoid nasty surprises (the FIN might arrive after you have woken up from EPOLLIN but before you call read, if you are in ET mode, you'll not get another notification).

winsock 2. thread safety for simultaneous send's. tcp

is it possible to have multiple threads sending on the same socket? will there be interleaving of the streams or will the socket block on the first thread (assuming tcp)? the majority of opinions i've found seems to warn against doing this for obvious fears of interleaving, but i've also found a few comments that state the opposite. are interleaving fears a carryover from winsock1 and are they well-founded for winsock2? is there a way to setup a winsock2 socket that would allow for lack of local synchronization?
two of the contrary opinions below... who's right?
comment 1
"Winsock 2 implementations should be completely thread safe. Simultaneous reads / writes on different threads should succeed, or fail with WSAEINPROGRESS, depending on the setting of the overlapped flag when the socket is created. Anyway by default, overlapped sockets are created; so you don't have to worry about it. Make sure you don't use NT SP6, if ur on SP6a, you should be ok !"
source
comment 2
"The same DLL doesn't get accessed by multiple processes as of the introduction of Windows 95. Each process gets its own copy of the writable data segment for the DLL. The "all processes share" model was the old Win16 model, which is luckily quite dead and buried by now ;-)"
source
looking forward to your comments!
jim
~edit1~
to clarify what i mean by interleaving. thread 1 sends the msg "Hello" thread 2 sends the msg "world!". recipient receives: "Hwoel lorld!". this assumes both messages were NOT sent in a while loop. is this possible?
I'd really advice against doing this in any case. The send functions might send less than you tell it to for various very legit reasons, and if another thread might enter and try to also send something, you're just messing up your data.
Now, you can certainly write to a socket from several threads, but you've no longer any control over what gets on the wire unless you've proper locking at the application level.
consider sending some data:
WSASend(sock,buf,buflen,&sent,0,0,0:
the sent parameter will hold the no. of bytes actually sent - similar to the return value of the send()function. To send all the data in buf you will have to loop doing a WSASend until all all the data actually get sent.
If, say, the first WSASend sends all but the last 4 bytes, another thread might go and send something while you loop back and try to send the last 4 bytes.
With proper locking to ensure that can't happen, it should e no problem sending from several threads - I wouldn't do it anyway just for the pure hell it will be to debug when something does go wrong.
is it possible to have multiple threads sending on the same socket?
Yes - although, depending on implementation this can be more or less visible. First, I'll clarify where I am coming from:
C# / .Net 3.5
System.Net.Sockets.Socket
The overall visibility (i.e. required management) of threading and the headaches incurred will be directly dependent on how the socket is implemented (synchronously or asynchronously). If you go the synchronous route then you have a lot of work to manually manage connecting, sending, and receiving over multiple threads. I highly recommend that this implementation be avoided. The efforts to correctly and efficiently perform the synchronous methods in a threaded model simply are not worth the comparable efforts to implement the asynchronous methods.
I have implemented an asynchronous Tcp server in less time than it took for me to implement the threaded synchronous version. Async is much easier to debug - and if you are intent on Tcp (my favorite choice) then you really have few worries in lost messages, missing data, or whatever.
will there be interleaving of the streams or will the socket block on the first thread (assuming tcp)?
I had to research interleaved streams (from wiki) to ensure that I was accurate in my understanding of what you are asking. To further understand interleaving and mixed messages, refer to these links on wiki:
Real Time Messaging Protocol
Transmission Control Protocol
Specifically, the power of Tcp is best described in the following section:
Due to network congestion, traffic load balancing, or other unpredictable network behavior, IP packets can be
lost, duplicated, or delivered out of order. TCP detects these problems, requests retransmission of lost
packets, rearranges out-of-order packets, and even helps minimize network congestion to reduce the
occurrence of the other problems. Once the TCP receiver has finally reassembled a perfect copy of the data
originally transmitted, it passes that datagram to the application program. Thus, TCP abstracts the application's
communication from the underlying networking details.
What this means is that interleaved messages will be re-ordered into their respective messages as sent by the sender. It is expected that threading is or would be involved in developing a performance-driven Tcp client/server mechanism - whether through async or sync methods.
In order to keep a socket from blocking, you can set it's Blocking property to false.
I hope this gives you some good information to work with. Heck, I even learned a little bit...

Resources