Too many persistent TCP connections - linux

We have around 500 clients connected to a Linux RedHat ES 5 server.
Recently it occurs, that the server still holds connections to clients which have been rebootet without stopping the application, which communicates with the server, before.
A netstat on the client always returns only one established connection to the server. After a client reboot, communication runs over a new established connection. On server side sometimes the old connection is closed, sometimes it stays in state established so that we have a growing number of established connections to each client.
Because various client operating systems are affected, I think that this isn't an application issue, but one of the Linux OS of the server.
I tried to tune the values of
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
without success.
Also I tried to set the maximum file handles value from 1024 to 2048, but connections still never get closed, not even after the TCP keepalive time expires.
Does somebody have an idea what could cause that strange behaviour?

Those settings allow you to configure the default keep-alive behavior (when keep-alives are enabled). However, they do not make keep-alives automatic. The feature must still be explicitly enabled on a per-socket basis via the SO_KEEPALIVE socket option.
See http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/ for details. From section 3:
Remember that keepalive support, even if configured in the kernel, is not the default behavior in Linux. Programs must request keepalive control for their sockets using the setsockopt interface.

Related

Websocket (node.js) connection limit, clients are getting disconnected after reaching 400-450 connections

I have a big problem with socket.io connection limit. If the number of connections is more than 400-450 connected clients (by browsers) users are getting disconnected. I increased soft and hard limits for tcp but it didn't help me.
The problem is only for browsers. When I tried to connect by socket-io-client module from other node.js server I reached 5000 connected clients.
Its very big problem for me and totally blocked me. Please help.
Update
I have tried with standard Websocket library (ws module with node.js) and problem was similar. I can reach only 456 connected clients.
Update 2
I devided connected clients between a few instances of server. Every group of clients were connecting by other port. Unfortunately this change didn't help me. Sum of connected users was the same like before.
Solved (2018)
There were not enough open ports for a Linux user which run the pm2 manager ("pm2" or "pm" username).
You may be hitting a limit in your operating system. There are security limits in the number of concurrent files open, take a look at this thread.
https://github.com/socketio/socket.io/issues/1393
Update:
I wanted to expand this answer because I was answering from mobile before. Each new connection that gets established is going to open a new file descriptor under your node process. Of course, each connection is going to use some portion of RAM. You would most likely run into the FD limit first before running out of RAM (but that depends on your server).
Check your FD limits: https://rtcamp.com/tutorials/linux/increase-open-files-limit/
And lastly, I suspect your single client concurrency was not using the correct flags to force new connections. If you want to test concurrent connections from one client, you need to set a flag on the webserver:
var socket = io.connect('http://localhost:3000', {'force new connection': true});

Permanent TCP connection for administration use

I am facing the following situation:
I have several devices (embedded devices running ARCH Linux) and i would like to have administration access to each device at any time. The problem is the devices are behind a NAT, so establishing a connection from a server to a device is not possible. How could i achieve this?
I thought i could write a simple service running on the device that opens a connection to a server at startup. This TCP connection remains open and can be used from the server to administrate the device. But is it a good idea to keep TCP connections open for a long time? If i have a lot of devices, for example 1000, will i have a problem on the server side with 1000 open TCP connections?
Is there maybe another way?
Thanks a lot!
But is it a good idea to keep TCP connections open for a long time?
It's not necessarily a bad idea; although in practice the connections will fail from time to time (e.g. due to network reconfiguration, temporary network outages, etc), so your clients should contain logic to reconnect automatically when this happens. Also note that TCP will not usually not detect it when a completely-idle TCP connection no longer has connectivity, so to avoid "zombie connections" that aren't actually connected, you may want to either enable SO_KEEPALIVE, or have your clients and/or server send the (very occasional) bit of dummy data on the socket just to goose the TCP stack into checking whether connectivity still exists on the socket.
If i have a lot of devices, for example 1000, will i have a problem on the server side with 1000 open TCP connections?
Scaling is definitely an issue you'll need to think about. For example, select() is typically implemented to only handle up to a fixed number of connections (often 1024), or if your server is using the thread-per-connection model, you'd find that a process with 1000+ threads is not very efficient. Check out the c10k problem article for lots of interesting details about various approaches and how well they scale up (or don't).
Is there maybe another way?
If you don't need immediate access to the clients, you could always have them check in periodically instead (e.g. once every 5 minutes); or you could have them occasionally send a UDP packet to the server instead of keeping a TCP connection all the time, just to let the server know their presence, and have the server indicate to them somehow (e.g. by updating a well-known web page that the clients check from time to time) when it wanted one of them to open a full TCP connection. Or maybe just use multiple servers to share the load...
The only limit I know of is imposed by state tracking in the iptables code. Check the value of net.ipv4.netfilter.ip_conntrack_max on both sides if you're using this to make sure you have enough headroom for other activities.
If you set the socket option SO_KEEPALIVE before the connect() call, the kernel will send TCP keepalives to make sure the far end is still there. This will mean that connections won't linger forever in the event of a reboot.

linux refuse to open listening port from localhost

I have problem to open a listening port from localhost in a heavy loaded production system.
Sometimes some request to my port 44000 failed. During that time , I checked the telnet to the port with no response, I'm wonder to know the underneath operations takes there. Is the application that is listening to the port is failing to response to the request or it is some problem in kernel side or number of open files.
I would be thankful if someone could explain the underneath operation to opening a socket.
Let me clarify more. I have a java process which accept state full connection from 12 different server.requests are statefull SOAP message . this service is running for one year without this problem. Recently we are facing a problem that sometimes connection from source is not possible to my server in port 44000. As I checked During that time telnet to the service is not possible even from local server. But all other ports are responding good. they all are running with same user and number of allowed open files are much more bigger than this all (lsof | wc -l )
As I understood there is a mechanism in application that limits the number of connection from source to 450 concurrent session, And the problem will likely takes when I'm facing with maximum number of connection (but not all the time)
My application vendor doesn't accept that this problem is from his side and points to os / network / hardware configuration. To be honest I restarted the network service and the problem solved immediately for this special port. Any idea please???
Here's a quick overview of the steps needed to set up a server-side TCP socket in Linux:
socket() creates a new socket and allocates system resources to it (*)
bind() associates a socket with an address
listen() causes a bound socket to enter a listening state
accept() accepts a received incoming attempt, and creates a new socket for this connection. (*)
(It's explained quite clearly and in more detail on wikipedia).
(*): These operations allocate an entry in the file descriptor table and will fail if it's full. However, most applications fork and there shouldn't be issues unless the number of concurrent connections you are handling is in the thousands (see, the C10K problem).
If a call fails for this or any other reason, errno will be set to report the error condition (e.g., to EMFILE if the descriptor table is full). Most applications will report the error somewhere.
Back to your application, there are multiple reasons that could explain why it isn't responding. Without providing more information about what kind of service you are trying to set up, we can only guess. Try testing if you can telnet consistently, and see if the server is overburdened.
Cheers!
Your description leaves room for interpretation, but as we talked above, maybe your problem is that your terminated application is trying to re-use the same socket port, but it is still in TIME_WAIT state.
You can set your socket options to reuse the same address (and port) by this way:
int srv_sock;
int i = 1;
srv_sock = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(srv_sock, SOL_SOCKET, SO_REUSEADDR, &i, sizeof(i));
Basically, you are telling the OS that the same socket address & port combination can be re-used, without waiting the MSL (Maximum Segment Life) timeout. This timeout can be several minutes.
This does not permit to re-use the socket when it is still in use, it only applies to the TIME_WAIT state. Apparently there is some minor possibility of data coming from previous transactions, though. But, you can (and should anyway) program your application protocol to take care of unintelligible data.
More information for example here: http://www.unixguide.net/network/socketfaq/4.5.shtml
Start TCP server with sudo will solve or, in case, edit firewalls rules (if you are connecting in LAN).
Try to scan ports with nmap (like with TCP Sync Handshake), or similar, to see if the port is opened to any protocol (maybe network security trunkates pings ecc.. to don't show hosts up). If the port isn't responsive, check privileges used by the program, check firewalls rules maybe the port is on but you can't get to it.
Mh I mean.. you are talking about enterprise network so I'm supposing you are on a LAN environment so you are just trying to localhost but you need it to work on LAN.
Anyway if you just need to open localhost port check privileges and routing, try to "tracert" and see what happens and so on...
Oh and check if port is used by a higher privilege service or deamon.
Anyway I see now that this is a 2014 post, np gg nice coding byebye

TCP Servers: Drop Connection, instead of resetting or responding?

Is it possible in Node.JS to "drop" a connection in such a way that
The client never receives a response (200, 404 or otherwise)
The client is never notified that the connection is terminated (never receives connection reset or end of stream)
The server's resources are released (the server should not attempt to maintain the connection in any way)
I am specifically asking about Node.JS HTTP Servers (which are really just complex TCP servers) on Solaris., but if there are cases on other OSes (Windows, Linux) or programming languages (C/C++, Java) that permit this, I am interested.
Why do I want this?
To annoy or slow down (possibly single-threaded) robots such as phpMyAdmin Probe.
I know this is not really something that matters, but these types of questions can better help me learn the boundaries of my programs.
I am aware that the client host is likely to re-transmit the packets of the connection since I am never sending reset.
These are not possible in a generic TCP stack (vs. your own custom TCP stack). The reasons are:
Closing a socket sends a RST
Even if you avoid sending a RST, the client continues to think the connection is open while the server has closed the connection. If the client sends any packet on this connection, the server is going to send a RST.
You may want to explore firewalling these robots and block / rate limit their IP addresses with something like iptables (linux) or the equivalent on solaris.
closing a connection should NOT send an RST. There is a 3 way tear down process.

tcp_tw_reuse vs tcp_tw_recycle : Which to use (or both)?

I have a website and application which use a significant number of connections. It normally has about 3,000 connections statically open, and can receive anywhere from 5,000 to 50,000 connection attempts in a few seconds time frame.
I have had the problem of running out of local ports to open new connections due to TIME_WAIT status sockets. Even with tcp_fin_timeout set to a low value (1-5), this seemed to just be causing too much overhead/slowdown, and it would still occasionally be unable to open a new socket.
I've looked at tcp_tw_reuse and tcp_tw_recycle, but I am not sure which of these would be the preferred choice, or if using both of them is an option.
According to Linux documentation, you should use the TCP_TW_REUSE flag to allow reusing sockets in TIME_WAIT state for new connections.
It seems to be a good option when dealing with a web server that have to handle many short TCP connections left in a TIME_WAIT state.
As described here, The TCP_TW_RECYCLE could cause some problems when using load balancers...
EDIT (to add some warnings ;) ):
as mentionned in comment by #raittes, the "problems when using load balancers" is about public-facing servers. When recycle is enabled, the server can't distinguish new incoming connections from different clients behind the same NAT device.
NOTE: net.ipv4.tcp_tw_recycle has been removed from Linux in 4.12 (4396e46187ca tcp: remove tcp_tw_recycle).
SOURCE: https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux
pevik mentioned an interesting blog post going the extra mile in describing all available options at the time.
Modifying kernel options must be seen as a last-resort option, and shall generally be avoided unless you know what you are doing... if that were the case you would not be asking for help over here. Hence, I would advise against doing that.
The most suitable piece of advice I can provide is pointing out the part describing what a network connection is: quadruplets (client address, client port, server address, server port).
If you can make the available ports pool bigger, you will be able to accept more concurrent connections:
Client address & client ports you cannot multiply (out of your control)
Server ports: you can only change by tweaking a kernel parameter: less critical than changing TCP buckets or reuse, if you know how much ports you need to leave available for other processes on your system
Server addresses: adding addresses to your host and balancing traffic on them:
behind L4 systems already sized for your load or directly
resolving your domain name to multiple IP addresses (and hoping the load will be shared across addresses through DNS for instance)
According to the VMWare document, the main difference is TCP_TW_REUSE works only on outbound communications.
TCP_TW_REUSE uses server-side time-stamps to allow the server to use a time-wait socket port number for outbound communications once the time-stamp is larger than the last received packet. The use of these time-stamps allows duplicate packets or delayed packets from the old connection to be discarded safely.
TCP_TW_RECYCLE uses the same server-side time-stamps, however it affects both inbound and outbound connections. This is useful when the server is the first party to initiate connection closure. This allows a new client inbound connection from the source IP to the server. Due to this difference, it causes issues where client devices are behind NAT devices, as multiple devices attempting to contact the server may be unable to establish a connection until the Time-Wait state has aged out in its entirety.

Resources