Node stalls with ESTABLISHED TCP connections - node.js

I have a CentOS 7 server running several Node-scripts at specific times with crontab.
The scripts are supposed to make a few web requests before exiting. Which works fine at all times on my local machine (running Mac OS X).
However on the server it sometimes seems like the node script stalls around a web request and nothing more happens, leaving the process and taking up memory on the server. Since the script is working at my machine I'm guessing that there is some issue on the server. I looked at netstat -tnp and found that the stalled PID's have left connections open in ESTABLISHED state and without sending or receiving any data. The connections are left like this.
tcp 0 0 x.x.x.x:39448 x.x.x.x:443 ESTABLISHED 17143/node
It happens on different ports, different PID's, different scripts and to different IP-addresses.
My guess is that the script stalls because node is waiting for some I/O operation (the request) to finish, but I can't find out any reason why this would happen. Has anyone else had issues with node leaving connections open at random?

This problem was apparently not related to any OS or Node setting. Our server provider had made a change to their network, which caused massive packet loss between the router and server. They reverted the change for us and now it's working again.

Related

node.js, TCP sockets, and docker - why connection reset by peer with more than 32 concurrent connections?

I am completely new to Docker as of yesterday. I have this node.js server that I am running that simply creates a TCP server and then handles messages coming in from a client. There is a test harness that was written by someone else that I have been running that allows for as many concurrent connections as you want.
When I run it locally, I can have 100+ concurrent connections no problem. I do however see a problem locally sometimes when I start to do 200+ concurrent connections, and it actually hits the socket.on(error) case and gives me an error. If I start the server right up though, and try 200 concurrent connections, it usually successfully completes, but subsequent tries will error out.
However when I run the harness with the server running in Docker, as soon as I try to run it with 33 concurrent connections, it does not work. I get a "connection reset by peer" error from the test harness. If I try 32 connections, it works fine. When I try 33+, I also do not see any error on the server, as I did when running locally. If I try doing 32 concurrent connections again after it fails with 33+, it will succeed just fine.
Does anyone know why this could be happening? Is there some sort of setting within Docker I have to change to allow more than 32 concurrent connections? I find it interesting that it's a hard limit of 32 and always works with 32 and never works with 33.
FWIW, in my node.js server, I have it listening on port 8080 and host 0.0.0.0. Initially I had it on 127.0.0.1 but while that worked locally, it would not work in Docker. To run it, I'm using the command docker run -p 8080:8080 app.
I am running Docker on my MacBook Pro running the latest OSX.
Any ideas?

tcp route was working, now is not

Running CentOS 6.7 on several VM's, Friday ran a test that sent data from machine 1 to machine 2, successful test. Today, (no changes on either end that anybody will admit to) can't repeat the test, the python script that sends the data gives me a traceback that says 'no route to host.' Can ping both ways (1-2, 2-1, can ssh both ways, nmap from 1 to 2 doesn't show the port I sent to on Friday, nmap on 2 to 127.0.0.1 shows port is open. Of the 12 machines on this network, none can send data to this machine on this port, where all could before. No changes in sending script, no changes in vm network (that I am aware of), as I said, theoretically no changes anywhere, but obviously something changed. If it matters, machine 2 is running Splunk 6.3.1 with a modular input that receives the data over port 2008, worked great until today. Disabled and re-enabled the modular input, restarted Splunk.
Not really sure where to look next.

socket.io max connection test on multicore machine

To answer my own question, it was a client issue, not a server one. For some unknown reason, my mac osx could not make over ~7.8k connections. Having ubuntu machine as a client solved the problem.
[Question]
I'm trying to estimate the maximum number of connections my server can keep. So I wrote a simple socket.io server and client test code. You can see it here. : gist
Above gist do very simple job. Server accepts all incoming socket requests, and periodically print out number of established connections, and cpu, memory usage. Client tries to connect to a given socket.io server with a certain number and does nothing but keeping connections.
When I ran this test with one server (ubuntu machine) and one client (from my mac osx), roughly 7800 connections were successfully made and it started to drop connections. So next, I ran more servers on different cpu cores, and ran the test again. What I expected is that more connections could be made (in total sum) because major bottleneck would be a CPU power. But instead what I saw was that how many cores I utilized, the total number of connections this server could keep is around 7800 connections. It's hard to understand why my server behaves like this. Can anyone give me the reason behind this behavior or point me out what I am missing?
Number of connections made before dropping any connection.
1 server : 7800
3 servers : 2549, 2299, 2979 (each)
4 servers : 1904, 1913, 1969, 1949 (each)
Server-side command
taskset -c [cpu_num] node --stack-size=99999 server.js -p [port_num]
Client-side command
node client.js -h http://serveraddress:port -b 10 -n 500
b=10, n=500 means that client should see 10 connections established before trying another 10 connections, until 10*500 connections are made.
Package versions
socket.io, socket.io-client : 0.9.16
express : 3.4.8
CPU is rarely the bottleneck in these types of situations. It is more likely the maximum number of TCP connections allowed by the operating system, or a RAM limitation.

linux refuse to open listening port from localhost

I have problem to open a listening port from localhost in a heavy loaded production system.
Sometimes some request to my port 44000 failed. During that time , I checked the telnet to the port with no response, I'm wonder to know the underneath operations takes there. Is the application that is listening to the port is failing to response to the request or it is some problem in kernel side or number of open files.
I would be thankful if someone could explain the underneath operation to opening a socket.
Let me clarify more. I have a java process which accept state full connection from 12 different server.requests are statefull SOAP message . this service is running for one year without this problem. Recently we are facing a problem that sometimes connection from source is not possible to my server in port 44000. As I checked During that time telnet to the service is not possible even from local server. But all other ports are responding good. they all are running with same user and number of allowed open files are much more bigger than this all (lsof | wc -l )
As I understood there is a mechanism in application that limits the number of connection from source to 450 concurrent session, And the problem will likely takes when I'm facing with maximum number of connection (but not all the time)
My application vendor doesn't accept that this problem is from his side and points to os / network / hardware configuration. To be honest I restarted the network service and the problem solved immediately for this special port. Any idea please???
Here's a quick overview of the steps needed to set up a server-side TCP socket in Linux:
socket() creates a new socket and allocates system resources to it (*)
bind() associates a socket with an address
listen() causes a bound socket to enter a listening state
accept() accepts a received incoming attempt, and creates a new socket for this connection. (*)
(It's explained quite clearly and in more detail on wikipedia).
(*): These operations allocate an entry in the file descriptor table and will fail if it's full. However, most applications fork and there shouldn't be issues unless the number of concurrent connections you are handling is in the thousands (see, the C10K problem).
If a call fails for this or any other reason, errno will be set to report the error condition (e.g., to EMFILE if the descriptor table is full). Most applications will report the error somewhere.
Back to your application, there are multiple reasons that could explain why it isn't responding. Without providing more information about what kind of service you are trying to set up, we can only guess. Try testing if you can telnet consistently, and see if the server is overburdened.
Cheers!
Your description leaves room for interpretation, but as we talked above, maybe your problem is that your terminated application is trying to re-use the same socket port, but it is still in TIME_WAIT state.
You can set your socket options to reuse the same address (and port) by this way:
int srv_sock;
int i = 1;
srv_sock = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(srv_sock, SOL_SOCKET, SO_REUSEADDR, &i, sizeof(i));
Basically, you are telling the OS that the same socket address & port combination can be re-used, without waiting the MSL (Maximum Segment Life) timeout. This timeout can be several minutes.
This does not permit to re-use the socket when it is still in use, it only applies to the TIME_WAIT state. Apparently there is some minor possibility of data coming from previous transactions, though. But, you can (and should anyway) program your application protocol to take care of unintelligible data.
More information for example here: http://www.unixguide.net/network/socketfaq/4.5.shtml
Start TCP server with sudo will solve or, in case, edit firewalls rules (if you are connecting in LAN).
Try to scan ports with nmap (like with TCP Sync Handshake), or similar, to see if the port is opened to any protocol (maybe network security trunkates pings ecc.. to don't show hosts up). If the port isn't responsive, check privileges used by the program, check firewalls rules maybe the port is on but you can't get to it.
Mh I mean.. you are talking about enterprise network so I'm supposing you are on a LAN environment so you are just trying to localhost but you need it to work on LAN.
Anyway if you just need to open localhost port check privileges and routing, try to "tracert" and see what happens and so on...
Oh and check if port is used by a higher privilege service or deamon.
Anyway I see now that this is a 2014 post, np gg nice coding byebye

Trouble on state FIN_WAIT_1

recently i've got some port holding on FIN_WAIT_1 state till two days later. The target port is used by one server process ever and client connect to the server process through this port.
The situation is we stopped the server process, and obviously some client is still connecting with the server at that moment. From my understanding, the server process sends a FIN package to client and waiting for the ACK package back. Unfortunately, that ACK package seems not like coming to server side till two days later.
my question is there any config like timeout to FIN_WAIT_1 status. i walked through the internet searching but found nothing there. Please help tell me if you have any experience with this.
BTW, the server process has already gone while the FIN_WAIT_1 happening to the port.
Thanks in advance
The FIN_WAIT_1 state is waiting for the peer to ACK the FIN that this end has just sent. That outgoing FIN is subject to all the normal TCP retry and timeout processing, so if the other end has completely disappeared and never responds, TCP should time out the connection and reset it as a matter of course. That's why you couldn't find a specific FIN_WAIT_1 timeout: there isn't one, just the normal TCP write timers.
All that should have happened within ten minutes or so.
If the state persists and it causes other problems I don't think you don't have much option but to reboot.
Are you sure it's the same ports stuck in FIN_WAIT? It could be due to a load balancer or NAT device that is dropping the connections after an inactivity timeout and silently discarding any further packets, which is the default behavior on some devices.

Resources