Sporadic Cassandra WriteErrors when using Lightweight Transactions - cassandra

I have a service that connects to our Cassandra cluster and executes tens of thousands of queries per day using Lightweight (ACID) Transactions to implement the Consensus system desribed here. For the most part it works fine, but sporadically, the writes will fail with the error saying "Operation timed out - received only 1 responses" (or less commonly, only 0 responses). We're using the Datastax Python driver. When the error occurs, the full error line (at the end of the stack trace) reads:
WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_SERIAL'}
Is this something that seems expected to occur from time to time in a production Cassandra setup? Or does it seem like something where we could have a configuration problem with our Cassandra cluster or network?
Some information about our Cassandra cluster: It is an 8-node setup spread across 2 Amazon EC2 regions (4 nodes per region). All of the nodes are running version 3.3.0 of the Datastax Cassandra distribution.

From https://issues.apache.org/jira/browse/CASSANDRA-9328
There is cases where under contention the coordinator loses track of
whether the value it submitted to Paxos might be applied or not (see
CASSANDRA-6013). At which point we can't do anything else that
answering "sorry I don't know". And since a WriteTimeoutException
already means "I don't know", we throw it in that case too, even
though it's not a proper timeout per-se

Related

Why does a Cassandra node get picked as coordinator even when the driver keeps throwing OperationTimedOutException?

I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?
Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);
There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!

Cassandra: hinted handoff in case of multiple nodes down

Cassandra uses the concept of hinted handoff for consistency.
It means that if a node is down, the coordinator takes note of it and waits till it's up, and then resends the write request to it.
Does it mean that Cassandra sends success response back to the client even while it's waiting for the unavailable node to be up? If yes, then what if all of the target nodes were down? Won't it mean a successful response to the client even without a single write?
Hints are not stored if consistency cannot be acheived
For example consider you have 3 replicas and all nodes are down. In this case if write consistency is quorum then hints will not get stored and write will fail. Hints get stored only when one node is down and coordinator got success response from two nodes.
Only exception is write consistency ANY. In this case even if all replicas are down hint will get stored and write will be successful.

driver default retry policy

I am testing our cassandra cluster for resiliency, its a 9 node cluster with rf=3. When i disable all traffic on port 7000 of one node, the client gets a
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
The reason being that this host was only partially able to communicate to other nodes, but it then proceeds to retry again on the same host
INFO - c.d.d.c.p.LoggingRetryPolicy - - Retrying on read timeout on same host at consistency QUORUM (initial consistency: QUORUM, required responses: 2, received responses: 1, data retrieved: true, retries: 0)
It continues ALL it retries on the same host, and never recovers, eventually the request fails.
I can create a custom policy, but wondering why it never tries any other nodes ?
As per your definitions, the database will have only 3 copies of the information (RF=3), so, even though you have 9 nodes, the QUORUM will be evaluated only with the 3 nodes that actually are the owners of the data, this is defined with the number of tokens and their assignation in the nodes.
Before disabling the port in that node, was the cluster reported as healthy? (in other words, nodetool status reported all the nodes as UN Up and Normal). Is the latency reported by all the nodes similar? If you have a node with increased latencies, the query will timeout before it gets a response from it.
Before creating "custom policies", and once that you confirmed that all the nodes are healthy, reachable and available, you may want to explore using a lower consistency level (like ANY or ONE) which can improve resiliency and performance with an impact of accuracy, or increase the replication factor which will increase the number of nodes where you can find the data but with the inconvenience that the amount of disk utilization will increase.

com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response

We are using Apache Cassandra-v3.0.9 with com.datastax.cassandra:cassandra-driver-core:3.1.3. Our application works good all the time, but once in a week we start getting the following exception from our applications:
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:62)
at com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
at com.til.cms.graphdao.cassandra.dao.generic.CassandraICMSGenericDaoImpl.getCmsEntityMapForLimitedSize(CassandraICMSGenericDaoImpl.java:2824)
.....
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
These applications are hitting Cassandra datacenter for read requests. The data-center consists of 5 physical servers each with 2 disks, 64 GB RAM, 40 cores, 16GB heap with G1 GC.
There was no problem with Cassandra servers as per our investigation like there was no load average/iowait increase, gc pauses or nodetool/cqlsh connectivity etc. We just started getting these exceptions in our application logs until we restarted Cassandra servers. This exception was reported randomly for different Cassandra servers in the datacenter and we had to restart each of them. In normal time each of these Cassandra server servers 10K read requests/seconds and hardly 10 write requests/seconds. When we encounter this problem read requests are dramatically affected to 2-3 K/seconds.
The replication factor of our cassandra datacenter is 3 and following is way we are making connections
Cluster.builder()
.addContactPoints(nodes)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_ONE))
.withLoadBalancingPolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDatacenter).build())
.withSpeculativeExecutionPolicy(PerHostPercentileTracker.builder(13000).build())
.build()
EDIT:
We have observed before we start getting these exceptions, we getting the following WARN level exceptions in our java application.
2018-04-03 23:40:06,456 WARN [cluster1-timeouter-0]
com.datastax.driver.core.RequestHandler [RequestHandler.java:805] Not
retrying statement because it is not idempotent (this message will be
logged only once). Note that this version of the driver changes the
default retry behavior for non-idempotent statements: they won't be
automatically retried anymore. The driver marks statements
non-idempotent by default, so you should explicitly call
setIdempotent(true) if your statements are safe to retry. See
https://docs.datastax.com/en/developer/java-driver/3.1/manual/retries/ for more details.
2018-04-04 00:04:24,856 WARN [cluster1-nio-worker-2]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16632 ms,
which exceeds the configured maximum trackable value 13000
2018-04-04 00:04:24,858 WARN [cluster1-timeouter-0]
com.datastax.driver.core.PercentileTracker
[PercentileTracker.java:108] Got request with latency of 16712 ms,
which exceeds the configured maximum trackable value 13000

How data will be consistent on cassandra cluster

I have a doubt when i read datastax documentation about cassandra write consistency. I have a question on how cassandra will maintain consistent state on following scenario:
Write consistency level = Quorum
replication factor = 3
As per docs, When a write occurs coordinator node will send this write request to all replicas in a cluster. If one replica succeed and other fails then coordinator node will send error response back to the client but node-1 successfully written the data and that will not be rolled back.
In this case,
Will read-repair (or hinted-handoff or nodetool repair) replicate the inconsistent data from node-1 to node-2 and node-3?
If not how will cassandra takes care of not replicating inconsistent data to other replicas?
Can you please clarify my question
You are completely right, the read repair or other methods will update the node-2 and node-3.
This means even the failed write will eventually update other nodes (if at least one succeeded). Cassandra doesn't have anything like rollback that relational databases have.
I don't see there is anything wrong - the system does what you tell it, i.e., two override one, and since the error messages sent back to the client as "fail", then the ultimate status should be "fail" by read repair tool.
Cassandra Coordinator node maintains the failed replica data in its storage and it will retry periodically (3 times or so) then if it succeeds then it will send the latest data, otherwise it will truncate the data in its storage.
In case of any read query, Coordinator node sends requests to all the replica nodes, and it will compare the results from all the replica nodes. If one of the replica node is not sending the latest data, then it will send read repair command to that node in order to keep the nodes in sync.

Resources