Handling many connections in node.js - node.js

I am making a live app with the use of websockets (express-ws npm package) in node.js.
The users send a message via ws every 10 seconds. Each of such requests takes about 1-1.5 milliseconds to handle (I have made some .time benchmarks). Everything works perfectly while there are less than ~9000 connections. However, if it grows above that, those 9000 requests every 10 seconds take 9000*1.5=13500ms > 10s and some users do not get their requests handled (as node.js is single-threaded). This is my first live app that gets so many online users at the same time so I do not know what to do. How to handle that many connections correctly?
I have read some articles about that and I have found some solutions which do not seem to work for me (at least I do not understand how to make them working).
Use the cluster module. The problem is that the requests have to share variables. I have an array of data which is updated or read during every request and clusters, as I have read and tested, are basically another processes which cannot share memory.
The same applies to worker_threads. They can kinda share memory, but I have to set up the communication between all threads and it still comes up to handling 9000 connections in 10 seconds which are not significantly faster than 9000 connections that have been in the beginning (that 9000 requests are simply a database search and an update with a few validations whether a user is registered and has provided valid data). Probably, if I throw the validation to a worker thread, the connections limit will grow up to 13000, but it is still insufficient.
I thought of creating a separate server on an another port (probably even in c++) and send all the requests that have passed the validation there (websocket between the servers). That seems like the best solution as for now but it still comes up to handling 9000 requests in one thread which will not make it much better.
So, how do I handle that many requests that need to share a variable efficiently? How do game servers which need to update the states of thousands of players multiple times per second do that?

Related

how many requests a node http server can handle per second without queueing any requests?

Does anyone know how many requests a basic single instance of node http server can handle per second without queueing any requests?
Actually, i need to write a nodejs app, that should be able to respond to about thousand incoming requests within 100ms consistently. I'm trying to test it in a 4 cpus server and running 4 instances in cluster mode. but currently it only able to handle less than 1000 requests per second consistently within 100ms
See this answer:
How many clients can an http-server can handle?
It's not about queuing but about how many concurrent requests Node can handle at the same time. It contains example benchmarks that you can use and tweak to your needs.
Now, about queuing: in a sense, every request is queued because only one thing can run at the same time in one Node process. So you can say that the answer to how many requests a Node http server can handle without queuing any requests is: one. Just like for Nginx for example.
Now, the time to respond is an entirely different matter and depends on many factors like what you actually do in those requests handlers.
For example a mean time per request that I got in my experiments here for 10000 concurrent connections was less than 2 milliseconds. If it does more then it can take more of course but there is no one number for all Node servers. It depends how efficient is your implementation.
Now, a big no-no for handling concurrency in Node are:
Never use any "Sync" function
Never use long running for and while loops
Never do any CPU-intensive work in your process that handles the reqests

In Node js. How many simultaneous requests can I send with the "request" package

How many simultaneous requests can I make with the request package?
I am expecting data back from every request confirming the request was received and processed successfully. Is this hardware or OS dependent? Where do I start looking?
One of the more recent versions of node.js does not enforce a limit on outgoing requests (older versions did). If you were literally trying to make millions of outgoing connections at the same time, then you would probably hit a limit on your own node.js server that would be OS specific. But, the practical limit is more likely going to be determined by the target host.
Since all your requests are being sent to the same host, the more likely limit will be determined by the server you are making the requests to. It will have some sort of limit for how many simultaneous requests it can have "in-flight" at the same time before it starts refusing new connections. What that number is depends entirely upon how the server is configured and built. For http://www.google.com, the number is probably hundreds of thousands or millions of requests because they have a huge server farm and requests are balanced across all of them. For some simple single CPU server, the limit would obviously be much smaller than that.
In addition, there will little use in sending zillions of requests to a single CPU server anyway because it won't be able to work on all of them at once anyway.
So, if you want to know what would work best for a given target host, you would have to set up an adjustable test harness so you could test scenarios where you send from 1, 2, 5, 10, 50, 100, 200, 500, 1000 at a time and see what the average response time is and where you start to get errors (if any).
If you don't want to do any of that type of testing, then a reasonably safe choice that doesn't attempt to fully optimize things is to put no more than 5 requests in flight at the same time.
You can either build something yourself to manage to N requests in flight at a time or you can use one of the existing libraries that will do that for you. The Bluebird promise library has a concurrency option on some of it's functions such as Promise.map() which will automatically do that for you for whatever concurrency value you set. The async library also has something similar.
If you want more specific help crafting the code to manage how many requests are in flight at a time or to build a test harness for it, please show us some of your code for the source of all the requests so we have some idea how that works (if it's a giant array of requests or what the source of the URLs is).

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

Using Fleck Websocket for 10k simultaneous connections

I'm implementing a websocket-secure (wss://) service for an online game where all users will be connected to the service as long they are playing the game, this will use a high number of simultaneous connections, although the traffic won't be a big problem, as the service is used for chat, storage and notifications... not for real-time data synchronization.
I wanted to use Alchemy-Websockets, but it doesn't support TLS (wss://), so I have to look for another service like Fleck (or other).
Alchemy has been tested with high number of simultaneous connections, but I didn't find similar tests for Fleck, so I need to get some real info from users of fleck.
I know that Fleck is non-blocking and uses Async calls, but I need some real info, cuz it might be abusing threads, garbage collector, or any other aspect that won't be visible to lower number of connections.
I will use c# for the client as well, so I don't need neither hybiXX compatibility, nor fallback, I just need scalability and TLS support.
I finally added Mono support to WebSocketListener.
Check here how to run WebSocketListener in Mono.
10K connections is not little thing. WebSocketListener is asynchronous and it scales well. I have done tests with 10K connections and it should be fine.
My tests shows that WebSocketListener is almost as fast and scalable as the Microsoft one, and performs better than Fleck, Alchemy and others.
I made a test on a Windows machine with Core2Duo e8400 processor and 4 GB of ram.
The results were not encouraging as it started delaying handshakes after it reached ~1000 connections, i.e. it would take about one minute to accept a new connection.
These results were improved when i used XSockets as it reached 8000 simultaneous connections before the same thing happened.
I tried to test on a Linux VPS with Mono, but i don't have enough experience with Linux administration, and a few system settings related to TCP, etc. needed to change in order to allow high number of concurrent connections, so i could only reach ~1000 on the default settings, after that he app crashed (both Fleck test and XSocket test).
On the other hand, I tested node.js, and it seemed simpler to manage very high number of connections, as node didn't crash when reached the limits of tcp.
All the tests where echo test, the servers send the same message back to the client who sent the message and one random other connected client, and each connected client sends a random ~30 chars text message to the server on a random interval between 0 and 30 seconds.
I know my tests are not generic enough and i encourage anyone to have their own tests instead, but i just wanted to share my experience.
When we decided to try Fleck, we have implemented a wrapper for Fleck server and implemented a JavaScript client API so that we can send back acknowledgment messages back to the server. We wanted to test the performance of the server - message delivery time, percentage of lost messages etc. The results were pretty impressive for us and currently we are using Fleck in our production environment.
We have 4000 - 5000 concurrent connections during peak hours. On average 40 messages are sent per second. Acknowledged message ratio (acknowledged messages / total sent messages) never drops below 0.994. Average round-trip for messages is around 150 miliseconds (duration between server sending the message and receiving its ack). Finally, we did not have any memory related problems due to Fleck server after its heavy usage.

Node js avoid pyramid of doom and memory increases at the same time

I am writing a socket.io based server and I'm trying to avoid the pyramid of doom and to keep the memory low.
I wrote this client - http://jsfiddle.net/QUDXU/1/ which i run with node client-cluster 1000. So 1000 connections that are making continuous requests.
For the server side a tried 3 different solutions which i tested. The results in terms of RAM used by the server, after i let everything run for an hour are:
Simple callbacks - http://jsfiddle.net/DcWmJ/ - 112MB
Q module - http://jsfiddle.net/hhsja/1/ - 850MB and increasing
Async module - http://jsfiddle.net/SgemT/ - 1.2GB and increasing
The server and clients are on different machines. (Softlayer cloud instances). Node 0.10.12 and Socket.io 0.9.16
Why is this happening? How can I keep the memory low and use some kind of library which allows to keep the code readable?
Option 1. You can use the cluster module and gracefully kill your workers from time to time (make sure you disconnect() first). You can check process.memoryUsage().rss > 130000000 in the master and kill the workers when they exceed 130MB, for example :)
Option 2. NodeJS has the habit of using memory and rarely doing rigorous cleanups. As V8 reaches the maximum memory limit, GC calls are more aggressive. So you could lower the maximum memory a node process can take up by running node --max-stack-size <amount>. I do this when running node on embedded devices (often with less than 64 MB of ram available).
Option 3. If you really want to keep the memory low, use weak references where it is possible (anywhere except in long-running calls) https://github.com/TooTallNate/node-weak . This way, the objects will get garbage collected sooner. Extensive tests to make sure everything works are needed, though. GL if u use this one :) https://github.com/TooTallNate/node-weak
It seems like the problem was on the client script, not on the server one. I ran 1000 processes, each of them emitting messages to the server at every second. I think the server was getting very busy resolving all of those requests and thus using all of that memory. I rewrote the client side like this, spawning a number of processes proportional to the number of processors, each of them connecting multiple times like this:
client = io.connect(selectedEnvironment, { 'force new connection': true, 'reconnect': false });
Notice the 'force new connection' flag that allows to connect multiple clients using the same instance of socket.io-client.
The part that solved my problem was actually how the requests were made: any client would make another request after a second from receiving the acknowledge of the previous request, not at every second.
Connecting 1000 clients is making my server using ~100MB RSS. I also used async on the server script which seems very elegant and easier to understand than Q.
The bad part is that I've been running the server for about 2-3 days and the memory rised at 250MB RSS. This, I don't know why.

Resources