Node.js SSL server frozen, high CPU, not crashed but no connections - node.js

I hope anyone could help me with this issue.
In our company we are setting up a node.js server, connected to a Java Push server.
I'm using https module instead of http and SLL certificates.
The connection between node and clients is made by socket.io, in server and client.
At the same time the node.js server is client of the java server, this connection is being made with regular sockets (net.connect).
The idea is that users connect to the server, join some channels, and when some data arrive from java server, this is dispatched to the corresponding users.
Everything seems to work fine, but after a while, like randomly, having like between 450 and 700 users, the server's CPU reaches 100%, all the connections are broken, but the server is not crashed. The thing is that if you go to the https://... in the browser, you are not getting 404 or something like that but SSL connection error, and its really fast.
I tried to add logs everywhere, but there's not something like a pattern, its like random.
If anybody have the same problem or could bring me a clue, or a tip to debug better, I'll appreciate anything.
Thanks a lot.

Okay, the problem is solved. It is a problem that will occur in every Linux server. So, if you are working with one of these, you need to read this.
The reason was the default limit of files the Linux server had per each process.
Seems that ever single linux server comes with this limitation of 1024 files opened by each process, you can check your limit with:
# ulimit -n
To increase this number
# ulimit -n 5000 (for example)
Each socket creates a new virtual file.
For some reason my server was not displaying any error, the server just got frozen, stopping the log and no signal or evidence of anything. It was when I set up a copy of the server in another machine, when it started to send
warn: error raised: Error: accept EMFILE
warn: error raised: Error: accept EMFILE
warn: error raised: Error: accept EMFILE
...
Be careful because if you are not root, you will only change this for the current session and not permanently.
Trick: If you want to cound the number of files, in this case, the number of files opened by your node process, take note of your process id and call this command.
# ls -l /proc/XXXXX/fd | wc -l
Where XXXXX is the process id. This will help you to know if this is your problem, once you launch your node server, you can use this command to check if it reaches a top, and it stops growing after it gets frozen. (by default 1024 or "ulimit -n").
If you only want to check which files are open by the process:
# ls -l /proc/XXXXX/fd
Hope this can help you. Any way if you are setting up a node js server I'm pretty sure you want to do that to be sure it won't melt.
Finally if you need help in future errors without log, you can try to straceing or dtrussing process
# strace -p <process-id>
should do the job.

Related

Is there a hard limit on socket.io connections?

Background
We have a server that has socket.io 2.0.4. This server receives petitions from a stress script that simulates clients using socket.io-client 2.0.4.
The scrip simulates the creation of clients ( each client with its own socket ) that sends a petition and immediately dies after, using socket.disconnect();
Problem
During the first few of seconds all goes well. But every test reaches a point in which the script starts spitting out the following error:
connect_error: Error: websocket error
This means that the clients my script is creating are not connecting to the server because they are unable to connect.
This script creates 7 clients per second ( spaced evenly throughout the second ), each client makes 1 petition and then dies.
Research
At first I thought there was an issue with file descriptors and limits imposed by UNIX, since the server is in a Debian machine:
https://github.com/socketio/socket.io/issues/1393
After following these suggestions, the issue remained however.
Then I though maybe my test script was not connecting correctly, so I changed the connection options as in this discussion:
https://github.com/socketio/socket.io-client/issues/1097
Still, to no avail.
What could be wrong?
I see the machine's CPU's are constantly at 100% so I guess I am pounding the server with requests.
But if I am not mistaken, the server should simply accept more requests and process them when possible.
Questions
Is there a limit to the amount of connections a socket.io server can handle?
When making such stress tests one needs to be aware of protections and gate keepers.
In our case, our stack was deployed in AWS. So first, the AWS load balancers started blocking us because they thought the system was being DDOSed.
Then, the Debian system was getting flooded and it started refusing connections with SYN_FLOOD.
But after fixing that we were still having the error. Turns out we had to increase TCP connection's buffer and how TCP connections were being handled in the kernel.
Now it accepts all connections, but I wish no one the suffering we went through to find it out...

My application stops writing to files and opening new TCP connections after some time

I have no idea what could be causing this.
I have a Node application which connects to an external server over TCP and communicates with it. Part of its functionality also includes making relatively frequent HTTP requests.
Each instance of the application establishes up to 30 TCP connections to the external server, and makes HTTP requests as needed. Previously, I've been hosting the application on relatively cheap VPSes, with one instance of the application per server.
Now I'm setting it up on a proper dedicated server. I could set it up to run one instance on the dedicated and increase the connection limit that I've set so that one instance could cover several smaller instances on the VPSes, but I'd rather set up several instances of the application on the dedicated each limited to 30 connections.
The application also writes logs to disk (just a plain flat file), and sends logs via UDP to an external logging server. This is done using winston.
After some uptime, however, I'm experiencing an issue where HTTP requests time out (ETIMEDOUT) and the logs stop being written to disk. The application itself is still running, and the TCP connection to the server is still active and working. I can communicate with the application through that connection and it responds as expected. The logging server is still receiving the UDP packets as well. I've noticed that the log files stop being written to, but after a few minutes they appear to be flushed to disk finally, and the missed logs then appear.
My first suspicion was an open-files limit being hit, but the OS (Ubuntu) doesn't have a limit that I'm hitting. I tried disabling any Node HTTP Agent behavior (I'm using the request module, so I just passed false for the agent option).
It's not the webserver on the other end rejecting my connections. While the issue was occurring I was able to successfully wget a file from the webserver using the same external IP as the Node app is using.
I'm tailing the log file and noticing that the time between when a line is generated and when it's flushed to the disk is gradually increasing.
CPU and memory usage are low so there's no way that's the issue. iowait in top is 0.0. I have no idea where to go from here. Any help at all would be greatly appreciated.
I have Node 5.10.1.

Not getting output from websocket-bench when testing Primus

Hopefully this tool isn't too obscure and somebody can help me out, since it would be super useful if I could figure out what I'm doing wrong.
I'm running
websocket-bench -t primus -p engine.io https://dev.example.org.
I see a bunch of connections and eventually disconnects on the server so it's definitely hitting it, but the program hangs on my command line with this:
Launch bench with 100 total connection, 20 concurent connection
0 message(s) send by client
1 worker(s)
WS server : primus
Even when I kill the server, no output. I tried running with the -v option but no luck. I tried a trivial example with -a 1, same thing.
It's kind of useful like this since at least I can see that opening 10k concurrent connections isn't causing anything catastrophic on the server, but it would be really nice to see the neat table that I can get it to output when I use an incorrect transport like this:
websocket-bench -t primus -p websockets -v https://dev.example.org
Anyone have a clue what I'm doing wrong?

Throttling express server

I'm using a very simple express server, with a PUT and GET routes on an Ubuntu machine, but if I use several clients (around 8) doing requests at the same time it very easily gets flooded and starts to return connect EADDRNOTAVAIL errors. I have found no way to avoid this other than reducing the number of requests per client, but is there a way to throttle answers on the server so that instead of returning error it queues petitions and serves them in due time?
Maybe it's better to check whether there are answers to requests on the client and not insert new ones if they have not been served? Client is here
Queuing seems to be wrong, you should first check your current ulimit (every connection needs a handle).
To solve your problem, just change the ulimit.

NodeJS load test poor performance (EADDRNOTAVAIL)

I'm beginning with web applications using NodeJS and there is one problem with my app I don't know how to solve.
the application (we use expressjs) is running smooth on my local machine but, when we deploy it to our dev server for load test, we're getting an error like this
Error: connect EADDRNOTAVAIL
at errnoException (net.js:770:11)
at connect (net.js:646:19)
at Socket.connect (net.js:711:9)
at asyncCallback (dns.js:68:16)
at Object.onanswer [as oncomplete] (dns.js:121:9)
GET XXXXXXX 500 21ms
Our application does not have a database, it deals with a Rest API backend. Every page we build needs one or more calls to our backend. I know we must use a caching system but we want to test without it.
Our load test simulates user navigation. It starts with 5 users and it adds another user every minute. When we have over 25 users, we begin to see the error in our logs.
At the beginning I thought it could be a problem regarding too many open connections, but our sys admins says that's not the case.
So, it would be great if anyone could give a hint about where should I look at.
EDIT: Our dev machine has 16 cores and we're running our application using cluster module. Calls to backend are handled with popular Mikael's request module.
As robertklep suggested, this is a problem of the SO running out virtual ports when opening too many outgoing connections. Follow his link for a detailed explanation.
When I increased the ports as the article says I still got the problem. With a some more googling I found out about problems with garbage collector and node network objects. It seems a good idea (when you need many many outgoing connections) to manually handle garbage collector.
Check out this post.
If you make sure it is not a program problem, you can change the linux system configure to solve this problem:
[xxx#xxx~]$vim /etc/sysctl.conf
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
[xxx#xxx~]$sysctl -p

Resources