Linux TCP weirdly unresponsive when under heavy load - linux

I'm trying to get an HTTP server I'm writing on to behave well when under heavy load, but I'm getting some weird behavior that I cannot quite understand.
My testing consists of using ab (the Apache benchmark program) over the loopback interface at a concurrency level of 1000 (ab -n 50000 -c 1000 http://localhost:8080/apa), while straceing the server process. Strace both slows processing down well enough for the problem to be readily reproducible and allows me to debug the server internals post completion to some extent. I also capture the network traffic with tcpdump while the test is running.
What happens is that ab stops running a while into the test, complaining that a connection returned ECONNRESET, which I find a bit weird. I could easily buy into a connection timing out since the server might simply not have the bandwidth to process them all, but shouldn't that reasonably return ETIMEDOUT or even ECONNREFUSED if not all connections can be accepted?
I used Wireshark to extract the packets constituting the first connection to return ECONNRESET, and its brief packet list looks like this:
(The entire tcpdump file of this connection is available here.)
As you can see from this dump, the connection is accepted (after a few SYN retransmissions), and then the request is retransmitted a few times, and then the server resets the connection. I'm wondering, what could cause this to happen? Normally, Linux' TCP implementation ACKs data before the reading process even chooses to receive it so long as their is space in the TCP window, so why doesn't it do that here? Are there some kind of shared buffers that are running out? Most importantly, why is the kernel responding with a RST packet all of a sudden instead of simply waiting and letting the client re-transmit further?
For the record, the strace of the process indicates that it never even accepts a connection from the port in this connection (port 56946), so this seems to be something Linux does on its own. It is also worth noting that the server works perfectly well as long as ab's concurrency level is low enough (it works perfectly well up to about 100, and then starts failing intermittently somewhere between 100-500), and that its request throughput is rather constant regardless of the concurrency level (it processes somewhere between 6000-7000 requests per second as long as it isn't being straced). I have not found any particular correlation between the frequency of the problem occurring and my backlog setting to listen() (I'm currently using 128, but I've tried up to 1024 without it seeming to make a difference).
In case it matters, I'm running Linux 3.2.0 on this AMD64 box.

The backlog queue filled up: hence the SYN retransmissions.
Then a slot became available: hence the SYN/ACK.
Then the GET was sent, followed by four retransmissions, which I can't account for.
Then the server gave up and reset the connection.
I suspect you have a concurrency or throughput problem in your server which is preventing you from accepting connections rapidly enough. You should have a thread that is dedicated to doing nothing else but calling accept() and either starting another thread to handle the accepted socket or else queueing a job to handle it to a thread pool. I would then speculate that Linux resets connections on connections which are in the backlog queue and which are receiving I/O retries, but that's only a guess.

Related

TCP: Improving reliability with a broken connection

I'm working on an application where I need to ensure that even if the network goes down, messages will still arrive at their destination reliably, in-order, and unmodified. I've been using TCP, and up until now, I was just using a strategy of:
If a send/receive fails, do it again until no error.
If the remote disconnects, wait until the next connection and replace the socket I was send/receiving from with this new one (achieved through some threading and blocking to ensure it's swapped cleanly).
I recently realised that this doesn't work, as send can't report errors indicating that the remote hasn't received the message (cite eg. here).
I did also learn that TCP connections can survive brief network outages, as the kernel buffers the packets until the connection is declared dead after the timeout period (cite.
here).
The question: Is it a feasible strategy to just crank the timeout period waaaay higher on both client/server side (using setsockopt and the SO_KEEPALIVE options), so that a connection "never times out"? I'd have to handle errors related to the kernel's buffer filling up, but that should be relatively simple.
Are there any other failure cases?
If both ends doesn't explicitly disconnect, the tcp connection will stay open forever even if you unplug the cable. There is no timeout in TCP.
However, I would use (or design) an application protocol on top of tcp, making it possible to resume data transmission after re-connects. You may use HTTP for example.
That would be much more stable because depending on buffers would, as you say, at some time exhaust the buffers but the buffers would also being lost on let's say a power outage.

Node.js Server Timeout Problems (EC2 + Express + PM2)

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.
Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.
I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.
Any clue as to what might be happening and how I can solve this problem?
Here's my stack:
Node.js on AWS EC2 Micro Instance (using Express 4.0 + PM2)
Database on AWS RDS volume running MySQL (using node-mysql)
Sessions stored w/ Redis on same EC2 instance as the node.js app
Clients are iPhones accessing the server via AFNetworking
Once again no errors are firing with any of the modules mentioned above.
First of all you need to be a bit more specific about timeouts.
TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.
HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.
Now, there are many, many possible causes for this... from more trivial to less trivial:
Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.
Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...
Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.
Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.
Node event loop starvation: I will get to that at the end.
For memory leaks, the symptoms would be:
the node process would be using an increasing amount of memory.
To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.
cat /proc/sys/vm/swappiness
will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10
Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js
For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.
Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.
The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.
Finally, another way to debug the problem is:
env NODE_DEBUG=tls,net node <...arguments for your app>
node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.
Source: Experience of maintaining large deployments of node services for years.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

What constitutes "readable" (kqueue/epoll)

I know that if the remote host gracefully shuts down a connection, epoll will report EPOLLIN, and calling read or recv will not block, and will return 0 bytes (i.e. end of stream).
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
I've tried to find documentation on this behaviour, but have not succeeded, and while I could test it, I'm not interested in what happens on a specific distribution with a specific kernel version.
It is indeed not entirely obvious from the specification, but it works as follows for poll():
If there is data available to be read, even if the connection is closed, POLLIN is returned.
If neither reading or writing is possible because of a closed connection, POLLHUP or POLLERR is returned.
If reading is no longer possible but writing is (such as if the other side did shutdown(SHUT_WR)), POLLIN is returned and POLLHUP and POLLERR are not returned. (This allows waiting for POLLOUT normally.)
The simple thing to do is to try a read when any of POLLIN, POLLHUP and POLLERR are set.
In kqueue(), there is just an EVFILT_READ filter that may be triggered. This is described in the man page and should be clear enough.
Note that if you don't enable TCP keepalives (FreeBSD enables them by default but most other operating systems do not), waiting for data to read may get stuck forever if the network breaks in certain ways. Even if TCP keepalives are on, it tends to take a few hours to detect a broken connection.
It may not return EPOLLIN when the peer machine is closed unexpectly. In the past, I encounted this kind of phenomenon by VirtualBox as following steps:
Launch server on one VM.
Launch client on the other VM, connect the server and keep the connection without doing anything.
Save client VM state (something like hibernate).
And I saw the connection was still established in Server VM by
netstat -anp --tcp
In other words, EPOLLIN was not triggered in server.
http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ says that it will keep about 7200 seconds by default.
Of course, you can change keep alive timeout value by setsockopt or kernel parameters.
But some books says the better solution is to detect it in application layer, e.g. design the protocol that make sure sending some dummy messages periodically to detect the connection state.
epoll() is basically poll() but it scales better when you increase number of fds. I am not sure what it does when you are using it as edge-triggered interface. But for level triggered - yes, it will always return EPOLLIN, provided you are listening to this event, if end of stream is detected.
Though you must know TCP is not perfect. If connection is terminated abnormally (physycal link is down) by the other side, your side may never detect this until you write to the socket. TCP_KEEPALIVE may help, but not much.
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
No. That would imply receipt of a FIN, which means normal termination of the connection, which didn't happen. I would expect you would get an EPOLLERR or maybe EPOLLHUP.
But I'm curious why you wouldn't have already closed the socket on getting the write error, and why you would still be polling it. That's not correct behaviour.

send (2) succeed with established connection on unreachable network

I have some troubles understanding send (2) syscall on my linux x86 box.
Consider I established an SSH connection in my app with the other host in LAN. Then I put down the network (e.g. unplug the cable) and call the function (from my app) that sends some SSH packets trough the connection. This function inside calls send like
w = send(s->fd_out,buffer, len, 0);
In debugger I found that send returns len (i.e. w == len after the call).
How this can be if network is unreachable? When I call netstat it says my SSH connection is in state ESTABLISHED even though the network is down.
Can't understand why send executes normally and don't return any error (like EPIPE or ECONNRESET). May be an SSH connection lives some time after the network put down?
Thanks to all.
It's due to the implementation of TCP (and ssh uses TCP). Your send() just writes to a socket, which is just a file descriptor, and return means this operation is successful. It doesn't mean the data has been sent. A file descriptor is just some pointer with state for kernel after all. It's implemented in the kernel to keep TCP state a bit longer before failing a session. In fact, kernel is allowed to indefinitely keep this session until you explicitly call close() or kill your process. So your data is actually buffered in kernel space for network card to deliver it later.
Here is a quick experiment you can do:
Write a server that keeps receiving messages after establishing a connection
socket();
bind();
listen();
while (1) {
accept();
recv();
}
Write a client establishes a connection, takes cin inputs, and send a message to server whenever you hit return.
socket();
connect();
while (1) {
getline();
send();
}
Be careful that you NEVER call close() in while loop on either side. Now, if you unplug your cable AFTER you've established a connection, send a message, reconnect again, and send another message, you will find both messages on the server side.
What you will NEVER observe is that you receive the second message before the first one. You either lose them all, or receive them in order.
Now let me explain why it behaves like this. This is the state diagram of a TCP session.
https://dl.dropbox.com/u/17011409/TCP_State.png
You can see clearly that until you explicitly call close(), the connection will always be in established state. That's expected behavior of TCP. Establishing TCP connection is expensive, and keeping a session alive is good for performance. (That's partially how those TCP DOS works. Attackers keep establishing connections until server runs out of resources to keep TCP state information.)
In this state, your send() will be delegated to kernel for actual sending. TCP guarantees in-order, reliable delivery, but network can lose packets at any time. So TCP HAVE TO buffer your packets, and keep trying. There are algorithms to throttle this retry, but it's buffered for quite a very long time before it declares failure. The default time out to assume a packet loss is 3 seconds in Linux. But after a loss, TCP will retry. Then try again after certain seconds. The fact you unplugged your cable is just the same situation as a packet loss along the way to the destination. Once you plug in your cable again, a retry succeeds, and TCP will start sending remaining messages in order.
I know I must have failed to explain it thoroughly. You really need to know the details of TCP to reason about this behavior. It's required for the properties TCP is giving you. And it's not acceptable to expose internal implementations to programmer. (How about a send call that sometimes returns within milliseconds, and sometimes returns after 10 seconds? I bet no one will want this performance bomb in their code. The point of having a TCP library is exactly to hide this ugly nature of networks.) In fact, you even need to understand multiple RFCs and algorithms of how TCP realize in-order reliable delivery over a lossy network. Congestion control comes into the play of how long the buffer will be there as well. Wikipedia is a good starting point, but it's a full semester's undergraduate course if you really want to understand the details.
With a zero flags argument, send() is equivalent to write(2). And it will write your data on file descriptor (stores in kernel space to deliver).
You have to use other types of flag: MSG_CONFIRM may help you.

Resources