Cassandra Nodes Going Down - cassandra

I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.

If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.

Related

Why does a Cassandra node get picked as coordinator even when the driver keeps throwing OperationTimedOutException?

I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?
Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);
There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!

com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response

We are using Apache Cassandra-v3.0.9 with com.datastax.cassandra:cassandra-driver-core:3.1.3. Our application works good all the time, but once in a week we start getting the following exception from our applications:
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:62)
at com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
at com.til.cms.graphdao.cassandra.dao.generic.CassandraICMSGenericDaoImpl.getCmsEntityMapForLimitedSize(CassandraICMSGenericDaoImpl.java:2824)
.....
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
These applications are hitting Cassandra datacenter for read requests. The data-center consists of 5 physical servers each with 2 disks, 64 GB RAM, 40 cores, 16GB heap with G1 GC.
There was no problem with Cassandra servers as per our investigation like there was no load average/iowait increase, gc pauses or nodetool/cqlsh connectivity etc. We just started getting these exceptions in our application logs until we restarted Cassandra servers. This exception was reported randomly for different Cassandra servers in the datacenter and we had to restart each of them. In normal time each of these Cassandra server servers 10K read requests/seconds and hardly 10 write requests/seconds. When we encounter this problem read requests are dramatically affected to 2-3 K/seconds.
The replication factor of our cassandra datacenter is 3 and following is way we are making connections
Cluster.builder()
.addContactPoints(nodes)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_ONE))
.withLoadBalancingPolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDatacenter).build())
.withSpeculativeExecutionPolicy(PerHostPercentileTracker.builder(13000).build())
.build()
EDIT:
We have observed before we start getting these exceptions, we getting the following WARN level exceptions in our java application.
2018-04-03 23:40:06,456 WARN [cluster1-timeouter-0]
com.datastax.driver.core.RequestHandler [RequestHandler.java:805] Not
retrying statement because it is not idempotent (this message will be
logged only once). Note that this version of the driver changes the
default retry behavior for non-idempotent statements: they won't be
automatically retried anymore. The driver marks statements
non-idempotent by default, so you should explicitly call
setIdempotent(true) if your statements are safe to retry. See
https://docs.datastax.com/en/developer/java-driver/3.1/manual/retries/ for more details.
2018-04-04 00:04:24,856 WARN [cluster1-nio-worker-2]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16632 ms,
which exceeds the configured maximum trackable value 13000
2018-04-04 00:04:24,858 WARN [cluster1-timeouter-0]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16712 ms,
which exceeds the configured maximum trackable value 13000

Cassandra nodejs driver time out after a node moves

We use vnodes on our cluster.
I noticed that when the token space of a node changes (automatically on vnodes, during a repair or a cleanup after adding new nodes), the datastax nodejs driver gets a lot of "Operation timed out - received only X responses" for a few minutes.
I tried using ONE and LOCAL_QUORUM consistencies.
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
What do you guys suggest we should do to avoid this ? Having a custom retry policy ? Caching ? Changing the consistency ?
Example of behavior
when we see this:
4/7/2016, 10:43am Info Host 172.31.34.155 moved from '8185241953623605265' to '-1108852503760494577'
We see a spike of those:
{
"message":"Operation timed out - received only 0 responses.",
"info":"Represents an error message from the server",
"code":4608,
"consistencies":1,
"received":0,
"blockFor":1,
"isDataPresent":0,
"coordinator":"172.31.34.155:9042",
"query":"SELECT foo FROM foo_bar LIMIT 10"
}
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
In fact, when adding new node, there will be token range movement but Cassandra can still serve read requests using the old token ranges until the scale out has finished completely. So the behavior you're facing is very suspicious.
If you can reproduce this error, please activate query tracing to narrow down the issue.
The error can also be related to a node under heavy load and not replying fast enough

Cassandra 2.1.2 node stuck on joining the cluster

I'm trying but failing to join a new (well old, but wiped out) node to an existing cluster.
Currently cluster consists of 2 nodes and runs C* 2.1.2. I start a third node with 2.1.2, it gets to joining state, it bootstraps, i.e. streams some data as shown by nodetool netstats, but after some time, it gets stuck. From that point nothing gets streamed, the new node stays in joining state. I restarted node twice, everytime it streamed more data, but then got stuck again. (I'm currently on a third round like that).
Other facts:
I don't see any errors in the log on any of the nodes.
The connectivity seems fine, I can ping, netcat to port 7000 all ways.
I have 267 GB load per running node, replication 2, 16 tokens.
Load of a new node is around 100GBs now
I'm guessing that the node after few rounds of restarts, will finally suck in all of the data from running nodes and join the cluster. But definitely it's not the way it should work.
EDIT: I discovered some more info:
The bootstrapping process stops in the middle of streaming some table, always after sending exactly 10MB of some SSTable, e.g.:
$ nodetool netstats | grep -P -v "bytes\(100"
Mode: NORMAL
Bootstrap e0abc160-7ca8-11e4-9bc2-cf6aed12690e
/192.168.200.16
Sending 516 files, 124933333900 bytes total
/home/data/cassandra/data/leadbullet/page_view-2a2410103f4411e4a266db7096512b05/leadbullet-page_view-ka-13890-Data.db 10485760/167797071 bytes(6%) sent to idx:0/192.168.200.16
Read Repair Statistics:
Attempted: 2016371
Mismatch (Blocking): 0
Mismatch (Background): 168721
Pool Name Active Pending Completed
Commands n/a 0 55802918
Responses n/a 0 425963
I can't diagnose the error & I'll be grateful for any help!
Try to telnet from one node to another using correct port.
Make sure you are joining the correct name cluster.
Try use: nodetool repair
You might be pinging the external IP addressed, and your cluster communicates using internal IP addresses.
If you are running on Amazon AWS, make sure you have firewall open on both internal IP addresses.

Cassandra handshake and internode communication

While trying to use the cassandra 2.0.1 version, i started facing the handshaking with version problem .
There was an exception from OutboundTcpConnection.java stating that handshaking is not possible with a particular node.
I had a look at the TCP dump and cleared off the doubts that there was no problem in the network layer.
The application is not completing the handshaking process .Moreover , the port 7000 is still active.
For example, all my 8 nodes are up . But when i try a nodetool status, some nodes give a DN- down node status. Later on, after examining , the TCP backlog queue was found overflowing and the particular server has stopped listening for other servers in the cluster.
Am still not able to spot the root cause of this problem.
Note: I have tried with previous version of cassandra , 1.2.4, and it was working ok at that time. Before going to production , i thought it is better to go to 2.0.x version to avoid a migration overhead mainly. Can anyone provide an idea on this ?
Exception am getting is
NFO [HANDSHAKE-/aa.bb.cc.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 385) Handshaking version with /aa.bb.cc.XX
INFO [HANDSHAKE-/aa.bb.cc.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 396) Cannot handshake version with /aa.bb.cc.YY
This sounds like https://issues.apache.org/jira/browse/CASSANDRA-6349. You should upgrade.

Resources