Linux Server Benchmarking - Stuck at 31449 requests [closed]

Linux Server Benchmarking - Stuck at 31449 requests [closed] - node.js

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I apologize in advance for the length of this question, but I wanted to make it clear what I have already attempted.
Setup:
4 t1.micro EC2 instances (clients)
1 c1.medium EC2 instance (server) in a VPC (behind an Amazon Elastic Load Balancer (ELB))
1 simple node.js server running on the c1.medium (listening on http port 3000, returns simple "hello" html string)
4 node.js servers (1 on each t1.micro) doing distributed load testing with a custom benchmarking suite against the c1.medium
*Clients and server are running Ubuntu and have their file descriptor limits raised to 102400.
Run Case:
The 4 clients try to make n number of connections (simple http get request) a second, ranging from 400 to 1000, until 80,000 requests are made. The server has a hard response wait time, y, tested at 500, 1000, 2000, and 3000 milliseconds before it responds with "hello".
Problem:
At anything more than 500 connections/second, there is a several second (up to 10 or 15) halt, where the server no longer responds to any of the clients, and clients are left idle waiting for the responses. This is consistently at exactly 31449 requests. The clients show the appropriate amount of ESTABLISHED connections (using netstat) holding during this time. Meanwhile, the server shows around 31550 TIME_WAIT connections. After a few seconds this number reported by the server begins to drop, and eventually it starts responding to the clients again. Then, the same issue occurs at some later total request count, e.g. 62198 (though this is not as consistent). The file descriptor count for that port also drops to 0.
Attempted Resolutions:
Increasing ephemeral port range. Default was 32768-61000, or about 30k. Note that despite coming from 4 different physical clients, the traffic is routed through the local ip of the ELB and all ports are thus assigned to that ip. Effectively, all 4 clients are treated as 1 instead of the usually expected result of each of them being able to use the full port range. So instead of 30k x 4 total ports, all 4 are limited to 30k.
So I increased the port range to 1024-65535 with net.ipv4.ip_local_port_range, restarted the server and the following was observed:
The new port range is used. Ports as low as in the 1000's and as high as 65000's are observed being used.
The connections are still getting stuck at 31449.
Total port's in TIME_WAIT state are observed going as high as 50000, after getting stuck around 31550 for 10-15 seconds.
Other tcp configurations were also changed, independent of each other and in conjuction with each other such as tc_fin_timeout, tcp_tw_recycle, tcp_tw_reuse, and several others without any sizable improvement. tcp_tw_recycle seems to help the most, but it makes the status results on the clients print out strangely and in the wrong order, and still doesn't guarantee that the connections don't get stuck. Also I understand that this is a dangerous option to enable.
Question:
I simply want to have as many connections as possible, so that the real server that gets put on the c1.medium has a high baseline when it is benchmarked. What else can I do to avoid hitting this 31449 connection wall short of recompiling the kernel or making the server unstable? I feel like I should be able to go much higher than 500/s, and I thought that increasing the port range alone should have shown some improvement, but I am clearly missing something else.
Thanks!

Related

Handling many connections in node.js

I am making a live app with the use of websockets (express-ws npm package) in node.js.
The users send a message via ws every 10 seconds. Each of such requests takes about 1-1.5 milliseconds to handle (I have made some .time benchmarks). Everything works perfectly while there are less than ~9000 connections. However, if it grows above that, those 9000 requests every 10 seconds take 9000*1.5=13500ms > 10s and some users do not get their requests handled (as node.js is single-threaded). This is my first live app that gets so many online users at the same time so I do not know what to do. How to handle that many connections correctly?
I have read some articles about that and I have found some solutions which do not seem to work for me (at least I do not understand how to make them working).
Use the cluster module. The problem is that the requests have to share variables. I have an array of data which is updated or read during every request and clusters, as I have read and tested, are basically another processes which cannot share memory.
The same applies to worker_threads. They can kinda share memory, but I have to set up the communication between all threads and it still comes up to handling 9000 connections in 10 seconds which are not significantly faster than 9000 connections that have been in the beginning (that 9000 requests are simply a database search and an update with a few validations whether a user is registered and has provided valid data). Probably, if I throw the validation to a worker thread, the connections limit will grow up to 13000, but it is still insufficient.
I thought of creating a separate server on an another port (probably even in c++) and send all the requests that have passed the validation there (websocket between the servers). That seems like the best solution as for now but it still comes up to handling 9000 requests in one thread which will not make it much better.
So, how do I handle that many requests that need to share a variable efficiently? How do game servers which need to update the states of thousands of players multiple times per second do that?

What are some good settings for seeding a ton of torrents? (>10000)

I'm running into a lot of trouble when trying to seed a lot of torrents( > 10k) with libtorrent.
They include:
Choking my network connection
Tracker requests timing out(libtorrent tracker error)
When using auto-manage(they go from checking to seeding very slowly, even when my active_seeding is set to unlimited.
I used to let them be automanaged, but I'd find that it makes nearly all of them unavailable.
Here are my current settings:
sessionSettings.setActiveDownloads(5);
sessionSettings.setActiveLimit(-1);
sessionSettings.setActiveSeeds(-1);
sessionSettings.setActiveDHTLimit(5);
sessionSettings.setPeerConnectTimeout(25);
sessionSettings.announceDoubleNAT(true);
sessionSettings.setUploadRateLimit(0);
sessionSettings.setDownloadRateLimit(0);
sessionSettings.setHalgOpenLimit(5);
sessionSettings.useReadCache(false);
sessionSettings.setMaxPeerlistSize(500);
My current method is to loop over all my 10k+ torrents, and run torrent.resume(). When using automanage, this basically only starts ~ 50 of the torrents, and the others start about at a rate of 1 torrent per 10 minutes, which wouldn't work. When not using automanage, it chokes my connection.
BUT, when I do only 30 of them, they all seem to seed correctly, so my next plan is try to resume() them in groupings either with a time delay, or after they've received a tracker_reply.
I tried to garner what I could from this, but don't know what my settings should be specifically:
http://blog.libtorrent.org/2012/01/seeding-a-million-torrents/
I'd really appreciate someone sharing their settings for seeding thousands of torrents,

When not using automanage, it chokes my connection.
Since you say it can run either on a hosted server or domestic internet connection then you will have not much of a choice but to throttle torrent startups. Domestic internet connections are generally behind consumer grade routers and possibly CGNAT, both of which have fairly small NAT tables that will eventually choke from concurrently established TCP connections (peer-peer connections, tracker announces) or UDP pseudo-connections (UDP trackers, µTP, DHT)
So to run many torrents at once you will have to limit all active maintenance traffic of that kind so that the torrents are only started to listen passively for incoming connections.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).

It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.

While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.

The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.

I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.

Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

linux outbound connections timeout or fail after approx 700 established connections

One of our linux servers today experienced problems opening outbound requests.
I've reviewed this answer, Increasing the maximum number of tcp/ip connections in linux and it appears as though we are well within the maximum limits.
At the time, netstat -an showed approximately 700 established connections.
Any new socket connections would fail, but nothing would be written to /var/log.
All connections are long-term, and usually open for several hours at a time.
Is there any logging that would help determine what configuration parameter we are bumping against?

nf_conntrack_tcp_timeout_established
It turns out that there’s another timeout value you need to be concerned with. The established connection timeout. Technically this should only apply to connections that are in the ESTABLISHED state, and a connection should get out of this state when a FIN packet goes through in either direction. This doesn’t appear to happen and I’m not entirely sure why.
So how long do connections stay in this table then? The default value for nf_conntrack_tcp_timeout_established is 432000 seconds. I’ll wait for you to do the long division…
Fun times.
I changed the timeout value to 10 minutes (600 seconds) and in a few days time I noticed conntrack_count go down steadily until it sat at a very manageable level of a few thousand.
We did this by adding another line to the sysctl file:
net.netfilter.nf_conntrack_tcp_timeout_established=600

Linux TCP weirdly unresponsive when under heavy load

I'm trying to get an HTTP server I'm writing on to behave well when under heavy load, but I'm getting some weird behavior that I cannot quite understand.
My testing consists of using ab (the Apache benchmark program) over the loopback interface at a concurrency level of 1000 (ab -n 50000 -c 1000 http://localhost:8080/apa), while straceing the server process. Strace both slows processing down well enough for the problem to be readily reproducible and allows me to debug the server internals post completion to some extent. I also capture the network traffic with tcpdump while the test is running.
What happens is that ab stops running a while into the test, complaining that a connection returned ECONNRESET, which I find a bit weird. I could easily buy into a connection timing out since the server might simply not have the bandwidth to process them all, but shouldn't that reasonably return ETIMEDOUT or even ECONNREFUSED if not all connections can be accepted?
I used Wireshark to extract the packets constituting the first connection to return ECONNRESET, and its brief packet list looks like this:
(The entire tcpdump file of this connection is available here.)
As you can see from this dump, the connection is accepted (after a few SYN retransmissions), and then the request is retransmitted a few times, and then the server resets the connection. I'm wondering, what could cause this to happen? Normally, Linux' TCP implementation ACKs data before the reading process even chooses to receive it so long as their is space in the TCP window, so why doesn't it do that here? Are there some kind of shared buffers that are running out? Most importantly, why is the kernel responding with a RST packet all of a sudden instead of simply waiting and letting the client re-transmit further?
For the record, the strace of the process indicates that it never even accepts a connection from the port in this connection (port 56946), so this seems to be something Linux does on its own. It is also worth noting that the server works perfectly well as long as ab's concurrency level is low enough (it works perfectly well up to about 100, and then starts failing intermittently somewhere between 100-500), and that its request throughput is rather constant regardless of the concurrency level (it processes somewhere between 6000-7000 requests per second as long as it isn't being straced). I have not found any particular correlation between the frequency of the problem occurring and my backlog setting to listen() (I'm currently using 128, but I've tried up to 1024 without it seeming to make a difference).
In case it matters, I'm running Linux 3.2.0 on this AMD64 box.

The backlog queue filled up: hence the SYN retransmissions.
Then a slot became available: hence the SYN/ACK.
Then the GET was sent, followed by four retransmissions, which I can't account for.
Then the server gave up and reset the connection.
I suspect you have a concurrency or throughput problem in your server which is preventing you from accepting connections rapidly enough. You should have a thread that is dedicated to doing nothing else but calling accept() and either starting another thread to handle the accepted socket or else queueing a job to handle it to a thread pool. I would then speculate that Linux resets connections on connections which are in the backlog queue and which are receiving I/O retries, but that's only a guess.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string