How to kill "CLOSE_WAIT" and "FIN_WAIT2" network connections - linux

I created a game using node.js and socket.io. All works well, but from time to time this game socket server doesn't respond to any connections. When I go to Process information -> Files and connections (in webmin), then I see there are many connections with CLOSE_WAIT and FIN_WAIT2 statuses. I think the problem is in these connections, because game fails when there are about 1,000 connections. Server OS is Ubuntu Linux 12.04.
How can I kill these connections or increase maximum allowed connections?

To add to Jim answer, i think there is a problem in your client handling of closing of socket connections . It seems your client is not closing the sockets properly(both server initiated and client initiated close) and that is the reason your server has so many wait states

You don't need to kill connections or increase the number allowed. You need to fix a defect in the application on one side of the connection, specifically, the side which does not initiate the close.
See Figure 13 of RFC 793. Your programs are at step 3 of the close sequence. The side which you see in FIN-WAIT-2 is behaving correctly. It has initiated the close and the TCP stack has sent a FIN packet on the network. The side in CLOSE-WAIT has the defect. The TCP stack on that side has received and acknowledged the FIN packet, but the application has failed to notice. How the application is expected to detect that the remote side has closed the connection will depend on your platform. Unfortunately, I am old, and don't know node.js or socket.io.
What happens in C is that the socket appears readable, but a read() returns a zero-length packet. When the application sees this, it is expected to call close(). You will find something equivalent in the docs for node.js or socket.io.
When you find it, considering answering your own question here and accepting the answer.

Linux has the SO_REUSEADDR option for setting socket parameters. It allows immediate reuse of the same port. Someone who knows your toolset can tell you how to set socket options. You may already know how. I do not know this toolset.
From older java docset:
http://docs.oracle.com/javase/1.5.0/docs/guide/net/socketOpt.html

Related

Does X11 have a lifesign or constant stream?

I have a fault tolerant application, where an X Server requests to start an Application on a remote client (by some other mechanism) and receive and display its X-window. Fault tolerance means that the server needs to detect loss of the connection to the client and then call a different back-up-client and start the application there and show the window.
My question is whether there exists a mechanism in the X11 protocol that allows to reliably detect in an X11-Server whether the connection has been broken or not.
Experiments show that when unplugging a cable connection it needs some TCP-Timeout to detect the connection loss on socket level. This is very OS-dependent. In our case it was abut 30 minutes after which the X-Server eventually closed the window.
So another assumption could be that the X11-stream constantly delivers some commands and the server could implement some logic like this: If the X11-stream does not deliver any X11 traffic for a timeout y (e.g. 3 seconds), we assume the connection is lost and actively close the window and establish the connection to the fall-back-client.
Is the assumption true? I did not see any such statement in the X11-protocol about how to detect connection loss. Is there any explicit lifesign that is regularly transmitted? Or is the assumption valid that there is constant traffic? Or could there be longer periods of inactivity where nothing is transmitted at all while the connection is perfectly up and running?
There is a NoOperation command from the client that could be used for such purpose. But do clients usually implement something like that as a lifesign?
I have a fault tolerant application, where an X Server needs to start an Application...
I don't think that an X server can "start an application". May be that some setup allows something similar to that, but normally is not so.
...whether there exists a mechanism in the X11 protocol that allows to reliably detect in an X11-Server whether the connection has been broken or not.
No, it does not exist. The X11 protocol is based on TCP/IP, which does not provide directly this "heartbeat". I think the assumption is that, if you click or otherwise stimulate an X11 window, the TCP layer will timeout or throw another error if the client application is gone.
I did not see any statement in the X11-protocol about how to detect connection loss.
There is a NoOperation command from the client that could be used for such purpose. But do clients usually implement something like that as a lifesign?
Maybe that some application uses that NoOperation, but the purpose would be different from what you need. I mean, the X11 server is like an extension from the point of view of an application; the application can have interest to know whether the server is up and working, but it is not true the contrary. And, anyway, even if the server could detect that the application is gone, probably there is no way to tell the server to launch another application.
Probably a special proxy could be deployed; it could launch the application and monitor the connection (in both ways) and take the required steps in case the application goes away. But then again, who would monitor the proxy application?
First of all, X Protocol relies completely on TCP to send/receive information.
You cannot safely put a timeout capable transaction in order to detect a timeout in TCP. TCP is designed to retransmit only those segments that have already been sent but no acknowledged. It is completely asynchronous, in the sense that you send a command, and you can receive many responses or events unrelated to that command, before you receive the response. There's no heartbeat mechanism on XProtocol (except that the NOOP command is sent to synchronize operations with the server, and you receive a response for it, but you cannot overuse it, as that slows down severely the X connection, just launch any client with the -synchronous option to see it, see X(7)). You can even have TCP connections alive for years without interchanging a single packet. There's some mechanism, activated by option SO_KEEPALIVE that makes tcp to employ such heartbeat on TCP for a connection that has no data to transmit, but the X11 protocol normally doesn't make use of it. You don't post any code, nor a description of how the system is configured. The standard XServer never starts a connection by itself, except when launched specifically to negotiate with an XDMCP server (and this is done on UDP protocol) to serve as an XTerminal.
From your words probably you don't know that the roles of server and client are exchanged in X Protocol (the client is the remote application that connects to the server to display its output, and the server is the application that controls your display, mouse and keyboard) There's no means for the server to create a new client, so you need to be creating this connection in other means (probably through SSH, but not described).
By the way, when you say:
Experiments show that when unplugging a cable connection it needs some TCP-Timeout to detect the connection loss on socket level. This is very OS-dependent. In our case it was abut 30 minutes after which the X-Server eventually closed the window.
That is not OS-dependent. It is precisely the standard behaviour when you don't have traffic to send, there's no packet exchanged, so no detection is made (except if your client ---remember, this is the remote application program that wants to show its data in your local server--- activates the SO_KEEPALIVE option, and it requires several losses before declaring a lost connection) In your case the amount of time is variable because timers don't start until there's some data sent over the unplugged connection, and this makes it variable (not OS dependant)
On other side, you cannot pretend the server is going to turn on your monitor in case you leave the office and turn it off by mistake or by accident. What is the fault tolerance specification in that case?
IMHO, in regard of the presentation protocol, the application should be ready to show you as much information about the system as soon as you activate the connection (but the connection must be something allowed to fail). What is important is the means you develop for the application to be fault tolerant, even in the case you are not there to see the display. Will be somebody be advised that no one is looking at the screen? Are you going to detect the absence of operators in that case? Don't take this as a flame, but common sense should imperate in this case.
In case you need to ensure the connectivity to the remote host is available, you need to use another means to check for it. I recommend you to have a simple application pinging the remote host and alerting in case you don't get a positive result. Or you can open a connection to the server and then close it as soon as you get a positive response from the server (the first packet, for example) This will lead us to the next step, that is to ensure that some human is looking at the (turned on) screen of the display :)
For example, you can run a client in parallel to the one you are interested in, and force a heartbeat by asking for some server atom name (or a root window property value) in a loop with some delay. This will make the connection fail or your client can alert in case it doesn't receive the answer in some configurable time.

Is SO_REUSEADDR socket options useful on the client-side?

I came across the sentence in one java client library:
socket.setReuseAddress(true);
Thought this is used to improve performance,
since the SO_REUSEADDR option can indicate that socket can forcibly
use the TIME_WAIT port even if it belongs to the other socket.
But Also I found that this option is mostly used in the server-side,
to enable the server restarting quickly, not waiting the TIME_WAIT socket to close.
My question is that Is this option useful for the client-side,
like this client library? Will this do harmful to the other socket, like some attack?
Thanks a lot!
-Dimi
It depends on what you mean by "client". You also mention "client library", which has nothing to do with it.
This is often misunderstood, SO_REUSEADDR is to be able to reuse a socket in TIME_WAIT, and TIME_WAIT only happens on one side of the TCP connection, the one that initiates the termination sequence i.e. sends the first FIN packet i.e. calls shutdown(SHUT_WR) first or calls close first, although the latter is unclear/may depend on other things such as connection state or platform, reasons why you should not call close before first calling shutdown(SHUT_WR). This article is a very informative as well as the two referenced at the end of the article. It makes clear that TIME_WAIT may occur on the listening (server) side as well as client side, and recommends actually having clients always initiate termination ("active close") so that the server doesn't accumulate sockets in TIME_WAIT, where that would be more of a problem.

socket.io disconnects clients when idle

I have a production app that uses socket.io (node.js back-end)to distribute messages to all the logged in clients. Many of my users are experiencing disconnections from the socket.io server. The normal use case for a client is to keep the web app open the entire working day. Most of the time on the app in a work day time is spent idle, but the app is still open - until the socket.io connection is lost and then the app kicks them out.
Is there any way I can make the connection more reliable so my users are not constantly losing their connection to the socket.io server?
It appears that all we can do here is give you some debugging advice so that you might learn more about what is causing the problem. So, here's a list of things to look into.
Make sure that socket.io is configured for automatic reconnect. In the latest versions of socket.io, auto-reconnect defaults to on, but you may need to verify that no piece of code is turning it off.
Make sure the client is not going to sleep such that all network connections will become inactive get disconnected.
In a working client (before it has disconnected), use the Chrome debugger, Network tab, webSockets sub-tab to verify that you can see regular ping messages going between client and server. You will have to open the debug window, get to the network tab and then refresh your web page with that debug window open to start to see the network activity. You should see a funky looking URL that has ?EIO=3&transport=websocket&sid=xxxxxxxxxxxx in it. Click on that. Then click on the "Frames" sub-tag. At that point, you can watch individual websocket packets being sent. You should see tiny packets with length 1 every once in a while (these are the ping and pong keep-alive packets). There's a sample screen shot below that shows what you're looking for. If you aren't seeing these keep-alive packets, then you need to resolve why they aren't there (likely some socket.io configuration or version issue).
Since you mentioned that you can reproduce the situation, one thing you want to know is how is the socket getting closed (client-end initiated or server-end initiated). One way to gather info on this is to install a network analyzer on your client so you can literally watch every packet that goes over the network to/from your client. There are many different analyzers and many are free. I personally have used Fiddler, but I regularly hear people talking about WireShark. What you want to see is exactly what happens on the network when the client loses its connection. Does the client decide to send a close socket packet? Does the client receive a close socket packet from someone? What happens on the network at the time the connection is lost.
webSocket network view in Chrome Debugger
The most likely cause is one end closing a WebSocket due to inactivity. This is commonly done by load balancers, but there may be other culprits. The fix for this is to simply send a message every so often (I use 30 seconds, but depending on the issue you may be able to go higher) to every client. This will prevent it from appearing to be inactive and thus getting closed.

Identifying remote disconnection in socket client

How do I find out from a socket client program that the remote connection is down (e.g. the server is down). When I do a recv and the server is down it blocks if I do not set any timeout. However in my case I cannot put any reliable timeout value to get around it since otherwise the recv times out even when the server is up but the response really takes longer than the timeout value that I have set.
Unfortunately, ZeroMQ just passes this on to the next layer. So the protocol you are implementing on top of ZeroMQ will have to handle this.
Heartbeats are recommended. Basically, just have one side send a message if the connection is otherwise idle. The other side can treat the absence of such messages as a failure condition and close the connection.
You may wish to modify your higher level protocols to be more robust. For example, you can submit a command, query its status, and allow the other side to forget about the command. That way, if the connection is lost, you can reconnect and query any outstanding commands. Any it doesn't have, you know didn't get through and can resubmit. Once you get a reply with the result of a command, you can tell the other side that it can now forget the response.
This allows you to keep the connection active while a long-running command is ongoing. Every so often you ask, "is everything okay". The other side responds, "yes". You can use long polling where the other side delays responding for a second or so while the command is in process. This allows it to return the results immediately rather than having to wait a second for your next query.
The specifics depend on your exact requirements, but you must design this correctly into your protocol.
If the remote host goes down without sending you a tcp FIN package then you have no chance to detect that. You can test that behaviour by firewalling a port after a connection has been established on that port. Your program will "hang" forever.
However, the Linux kernel supports a mechanism called TCP keep alives which are meant to close a tcp connection after a given timeout. If you can't specify a timeout for your application, than there isn't a reliable chance to use that. Last chance might be to use features of the application protocol (can you name it?), if that protocol does not support features for connection handling you may invent something on your own on top of that.

ZeroMQ: Check if someone is listening behind Unix domain socket

Context: Linux (Ubuntu), C, ZeroMQ
I have a server which listens on ipc:// SUB ZeroMQ socket (which physically is a Unix domain socket).
I have a client which should connect to the socket, publish its message and disconnect.
The problem: If server is killed (or otherwise dies unnaturally), socket file stays in place. If client attempts to connect to this stale socket, it blocks in zmq_term().
I need to prevent client from blocking if server is not there, but guarantee delivery if server is alive but busy.
Assume that I can not track server lifetime by some external magic (e.g. by checking a PID file).
Any hints?
Non-portable solution seems to be to read /proc/net/unix and search there for a socket name.
Without showing your code all of this is guesswork... that said...
If you have a PUB/SUB pair, the PUB will hang around to make sure that its message gets through. Perhaps you're not using the right type of zmq pair. Sounds more like you have a REP/REQ pair instead.
This way, once you connect from the client (the REQ side), you can do a zmq_poll to determine if the socket is available for writing. If yes, then go ahead with your write, otherwise shutdown the client and handle the error condition (if it is an error in your system).
Maybe you can try to connect to the socket first, with your own native socket. If the connection succeeds, it's quite high possibility your publisher could work fine.
There is another solution. Don't use ipc:// sockets. Instead use something like tcp://127.0.0.101:10001. On most UNIXes that will be almost as fast as IPC because the OS recognizes that it is a local connection and shortcuts the full IP stack processing.

Resources