I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?
Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);
There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!
I've recently written a simple TCP server using epoll, but I want to explore other mechanisms for high performance mutliplexing, to that end I came across io_uring, and am planning on making another simple TCP server using it.
However I read that the number of entries for io_uring is limited to 4096 in here https://kernel.dk/io_uring.pdf, which seems to imply that I won't be able to theoretically have more than that number of persistent connections.
To my understanding where normally I'd use something like epoll_wait() to wait on an event for epoll, I instead submit a specific request in io_uring and am notified when the request has completed/failed, so does that mean I can submit up to 4096 read() requests for example?
Have I misunderstood the use case of io_uring or have I misunderstood how to use it?
In the same document I linked, it says:
Normally an application would ask for a ring of a given size, and the
assumption may be that this size corresponds directly to how many requests the application can have pending in the
kernel. However, since the sqe lifetime is only that of the actual submission of it, it's possible for the application to
drive a higher pending request count than the SQ ring size would indicate.
Which is precisely what you'd do for the case of listening for messages on lots of sockets - it's just that the upper limit of how many submissions you can send at once is 4096.
The DataStax Cassandra driver of version 4 has got a feature of the throttling.
The documentation states:
Similarly, the request timeout encompasses throttling: the timeout starts ticking before the
throttler has started processing the request; a request may time out while it is still in the
throttler's queue, before the driver has even tried to send it to a node.
Great. However, let's say I have a dynamic list of some ids and I want to execute select requests to cassandra in parallel (using executeAsync()) for all ids in the list. Having list too large I will eventually face timeouts if requests are residing in the throttler's queue too long.
How can I overcome this issue? Is there any built-in rate limiting technique so I can do not care about how many requests in parallel I can execute, but just throw all of them to cassandra and then wait until they all are completed??
UPD: I am not interested in custom code solutions, as ofc we are capable to implement our own rate limit solution. I am asking precisely about driver's built-in mechanisms to achieve this.
In Mongo and HBase, we have a way to track the client connection, Is there a way to get the total client connection in Cassandra?
On client-side Datastax driver reports metrics on connections, task queues, queries and errors (connection errors, r/w timeouts, retries, speculative executions).
Can be accessed via the Cluster.getMetrics() operation (java).
Datastax provides a cassandra structure, CassMetrics (cpp driver). This structure contains min and max microseconds to execute a query, pending requests, timed-out requests, total connections and available connections. It gives the entire performance metrics of a particular snapshot of the session.
Each member of the structure is explained clearly in the following datastax documentation- https://docs.datastax.com/en/developer/cpp-driver/2.3/api/struct.CassMetrics/
Here, is an example program on how to use the structure effectively - https://github.com/datastax/cpp-driver/blob/master/examples/perf/perf.c
In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.