non blocking tcp connect with epoll - linux

My linux application is performing non-blocking TCP connect syscall and then use epoll_wait to detect three way handshake completion.
Sometimes epoll_wait returns with both POLLOUT & POLLERR events set for the same socket descriptor.
I would like to understand what's going on at TCP level. I'm not able to reproduce it on demand. My guess is that between two calls to epoll_wait inside my event loop we had a SYN+ACK/ACK/FIN sequence but again I'm not able to reproduce it.

It is likely for this to happen if the connect has failed - for example with "connection timed out" (for sockets doing a non-blocking connect, POLLOUT becomes set when the connect operation has finished for both successful and unsuccessful outcomes).
When POLLOUT becomes set for the socket, use getsockopt(sock, SOL_SOCKET, SO_ERROR, ...) to check if the connect succeeded or not (the SO_ERROR socket option is 0 in this case, and otherwise indicates why the connect failed).

Here is some good information on non-blocking tcp connect().
When a socket error is detected (i.e. connection closed/refused/timedout), epoll will return the registered interest events POLLIN/POLLOUT with POLLERR. So epoll_wait() will return POLLOUT|POLLERR if you registered POLLOUT, or POLLIN|POLLOUT|POLLERR if POLLIN|POLLOUT was registered.
Just because epoll returns POLLIN doesn't mean there will be data available to read, since recv() may just return the error from the non-blocking connect() call. I think epoll returns all the registered events with POLLERR to make sure the program calls send()/recv()/etc.. and gets the socket error. Some programs never check for POLLERR/POLLHUP and only catch socket errors on the next send()/recv() call.

Related

What constitutes "readable" (kqueue/epoll)

I know that if the remote host gracefully shuts down a connection, epoll will report EPOLLIN, and calling read or recv will not block, and will return 0 bytes (i.e. end of stream).
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
I've tried to find documentation on this behaviour, but have not succeeded, and while I could test it, I'm not interested in what happens on a specific distribution with a specific kernel version.
It is indeed not entirely obvious from the specification, but it works as follows for poll():
If there is data available to be read, even if the connection is closed, POLLIN is returned.
If neither reading or writing is possible because of a closed connection, POLLHUP or POLLERR is returned.
If reading is no longer possible but writing is (such as if the other side did shutdown(SHUT_WR)), POLLIN is returned and POLLHUP and POLLERR are not returned. (This allows waiting for POLLOUT normally.)
The simple thing to do is to try a read when any of POLLIN, POLLHUP and POLLERR are set.
In kqueue(), there is just an EVFILT_READ filter that may be triggered. This is described in the man page and should be clear enough.
Note that if you don't enable TCP keepalives (FreeBSD enables them by default but most other operating systems do not), waiting for data to read may get stuck forever if the network breaks in certain ways. Even if TCP keepalives are on, it tends to take a few hours to detect a broken connection.
It may not return EPOLLIN when the peer machine is closed unexpectly. In the past, I encounted this kind of phenomenon by VirtualBox as following steps:
Launch server on one VM.
Launch client on the other VM, connect the server and keep the connection without doing anything.
Save client VM state (something like hibernate).
And I saw the connection was still established in Server VM by
netstat -anp --tcp
In other words, EPOLLIN was not triggered in server.
http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ says that it will keep about 7200 seconds by default.
Of course, you can change keep alive timeout value by setsockopt or kernel parameters.
But some books says the better solution is to detect it in application layer, e.g. design the protocol that make sure sending some dummy messages periodically to detect the connection state.
epoll() is basically poll() but it scales better when you increase number of fds. I am not sure what it does when you are using it as edge-triggered interface. But for level triggered - yes, it will always return EPOLLIN, provided you are listening to this event, if end of stream is detected.
Though you must know TCP is not perfect. If connection is terminated abnormally (physycal link is down) by the other side, your side may never detect this until you write to the socket. TCP_KEEPALIVE may help, but not much.
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
No. That would imply receipt of a FIN, which means normal termination of the connection, which didn't happen. I would expect you would get an EPOLLERR or maybe EPOLLHUP.
But I'm curious why you wouldn't have already closed the socket on getting the write error, and why you would still be polling it. That's not correct behaviour.

how to check tcp peer is closed

I have a TCP network connection with remote host.(windows or linux)
if remote host process is terminated, recv() fails
an I know the connection is closed.
but, is there any way to check if the remote host
has closed connection without actually receiving data?
the point is, I want to periodically check if the remote host is still alive
but I don't want to give or receive any data.
thank you in advance
To be clear, recv() should not fail if the connection has been closed properly. It should return EOF (0 bytes). recv() will fail is the connection was closed abnormally.
To check if the connection has been closed without actually receiving data if it hasn't been, the best you can probably do is call recvmsg() with the MSG_PEEK flag. Ask for just one byte. If the connection has been closed then you'll get EOF (normal close) or an error (abnormal close). If it hasn't been closed, you'll either get EAGAIN (assuming you put the socket in non-blocking mode) or one byte of data. So, yes, technically you received a byte of data, but because of MSG_PEEK the kernel doesn't record the fact that you did, so it's as if you didn't. This all assumes that you've already read out of the kernel's buffer all of the data from the stream that arrived before the prospective error might have happened.
Of course rakib's comment applies: "There's no way to check if the remote host is alive, if you don't want to give or recv data.". Meaning this method won't detect scenarios like the remote host disappearing off the network without closing the connection, etc...
If you are monitoring the socket's ready-for-read state via select(), select() will return and indicate that the socket has data ready to read. Then when you try to recv() that data, recv() will either fail (if there was an error) or return 0 (to indicate EOF if the connection was closed cleanly).

Winsock send() blocks server?

I have read that the send() function on Winsock blocks until the ACK from the last packet is recieved. Now I am playing with a server for a turn based role playing game. Everything is handled by one thread (for 64 sockets). A request is recieved, handled and a response written to the socket(s). This process cannot be interrupted.
Is it possible to handle, say 1000 clients (one thread for every 64 sockets) with this method?
Wouldn't it block the whole server if a send() takes too long to complete or the client maliciously does not send the ACK or the connection gets interrupted?
Shall I split the logic of networking and request handling into 2 threads? If so the thread handling the network transfers could still be blocked by a send() or recv().
Or would it be best to use overlapped I/O?
send() blocks only if the socket is running in blocking mode and the socket's outbound buffer fills up with queued data. If you are managing multiple sockets in the same thread, do not use blocking mode. If one receiver does not read data in a timely maner, it can cause all of the connections on that thread to be affected. Use non-blocking mode instead, then send() will report when a socket has entered a state where blocking would occur, then you can use select() to detect when the socket can accept new data again. A better option is to use overlapped I/O or I/O Completion Ports instead. Submit outbound data to the OS and let the OS handle all of the waiting for you, notifying you when the data has eventually been accepted/sent. Do not submit new data for a given socket until you receive that notification. For scalability to a large number of connections, I/O Completion Ports are generally a better choice.
No, it doesn't work like that. From the MSDN documentation on send:
The successful completion of a send function does not indicate that the data was successfully delivered and received to the recipient. This function only indicates the data was successfully sent.

Proper way to shutdown() a socket on Linux, differences between Linux and Windows?

I am losing data on my sockets because I am doing a close().
The Linux-specific shutdown() manpage is not helpful:
The shutdown() call causes all or part of a full-duplex connection on
the socket associated with sockfd to be shut down. If how is SHUT_RD,
further receptions will be disallowed. If how is SHUT_WR, further
transmissions will be disallowed. If how is SHUT_RDWR, further
receptions and transmissions will be disallowed.
Microsoft's MSDN is MUCH better, but it being Windows specific, there are differences between it and Linux:
To assure that all data is sent and received on a connected socket
before it is closed, an application should use shutdown to close
connection before calling closesocket. One method to wait for
notification that the remote end has sent all its data and initiated a
graceful disconnect uses the WSAEventSelect function as follows :
1. Call WSAEventSelect to register for FD_CLOSE notification.
2. Call shutdown with how=SD_SEND.
3. When FD_CLOSE received, call the recv or WSARecv until the function completes with success and indicates that zero bytes were received. If SOCKET_ERROR is returned, then the graceful disconnect is not possible.
4. Call closesocket.
My Question
Under Linux, what is the equivalent of waiting for FD_CLOSE (step 1)?
I am getting answers and comments that think I am asking about the behavior on Windows. I am asking about the behavior on Linux, I am merely referencing the Windows documentation because it is much more clear and complete than Linux manpages.
The MSDN suggestions are followed in my very detailed answer to close() is not closing socket properly. This is essentially the same as Remy Lebeau's answer here.
Notice how the Microsoft documentation you quoted says: "One method to wait for notification ..." using WSAEventSelect. That means it is not the only way to do it. The same documentation also describes a similar approach using "overlapped receive calls" via WSARecv() instead. However, a more common (and not event-driven) approach is to call shutdown(), then call recv() in a loop until it returns 0 (graceful disconnect) or SOCKET_ERROR (-1), then call closesocket().

send (2) succeed with established connection on unreachable network

I have some troubles understanding send (2) syscall on my linux x86 box.
Consider I established an SSH connection in my app with the other host in LAN. Then I put down the network (e.g. unplug the cable) and call the function (from my app) that sends some SSH packets trough the connection. This function inside calls send like
w = send(s->fd_out,buffer, len, 0);
In debugger I found that send returns len (i.e. w == len after the call).
How this can be if network is unreachable? When I call netstat it says my SSH connection is in state ESTABLISHED even though the network is down.
Can't understand why send executes normally and don't return any error (like EPIPE or ECONNRESET). May be an SSH connection lives some time after the network put down?
Thanks to all.
It's due to the implementation of TCP (and ssh uses TCP). Your send() just writes to a socket, which is just a file descriptor, and return means this operation is successful. It doesn't mean the data has been sent. A file descriptor is just some pointer with state for kernel after all. It's implemented in the kernel to keep TCP state a bit longer before failing a session. In fact, kernel is allowed to indefinitely keep this session until you explicitly call close() or kill your process. So your data is actually buffered in kernel space for network card to deliver it later.
Here is a quick experiment you can do:
Write a server that keeps receiving messages after establishing a connection
socket();
bind();
listen();
while (1) {
accept();
recv();
}
Write a client establishes a connection, takes cin inputs, and send a message to server whenever you hit return.
socket();
connect();
while (1) {
getline();
send();
}
Be careful that you NEVER call close() in while loop on either side. Now, if you unplug your cable AFTER you've established a connection, send a message, reconnect again, and send another message, you will find both messages on the server side.
What you will NEVER observe is that you receive the second message before the first one. You either lose them all, or receive them in order.
Now let me explain why it behaves like this. This is the state diagram of a TCP session.
https://dl.dropbox.com/u/17011409/TCP_State.png
You can see clearly that until you explicitly call close(), the connection will always be in established state. That's expected behavior of TCP. Establishing TCP connection is expensive, and keeping a session alive is good for performance. (That's partially how those TCP DOS works. Attackers keep establishing connections until server runs out of resources to keep TCP state information.)
In this state, your send() will be delegated to kernel for actual sending. TCP guarantees in-order, reliable delivery, but network can lose packets at any time. So TCP HAVE TO buffer your packets, and keep trying. There are algorithms to throttle this retry, but it's buffered for quite a very long time before it declares failure. The default time out to assume a packet loss is 3 seconds in Linux. But after a loss, TCP will retry. Then try again after certain seconds. The fact you unplugged your cable is just the same situation as a packet loss along the way to the destination. Once you plug in your cable again, a retry succeeds, and TCP will start sending remaining messages in order.
I know I must have failed to explain it thoroughly. You really need to know the details of TCP to reason about this behavior. It's required for the properties TCP is giving you. And it's not acceptable to expose internal implementations to programmer. (How about a send call that sometimes returns within milliseconds, and sometimes returns after 10 seconds? I bet no one will want this performance bomb in their code. The point of having a TCP library is exactly to hide this ugly nature of networks.) In fact, you even need to understand multiple RFCs and algorithms of how TCP realize in-order reliable delivery over a lossy network. Congestion control comes into the play of how long the buffer will be there as well. Wikipedia is a good starting point, but it's a full semester's undergraduate course if you really want to understand the details.
With a zero flags argument, send() is equivalent to write(2). And it will write your data on file descriptor (stores in kernel space to deliver).
You have to use other types of flag: MSG_CONFIRM may help you.

Resources