Cassandra LOCAL_QUORUM is waiting for remote datacenter responses - cassandra

We have a 2 datacenters ( One in EU and one in US ) cluster with 4 nodes each deployed in AWS.
The nodes are separated in 3 racks ( Availability zones ) each.
In the cluster we have a keyspace test with replication: NetworkTopologyStrategy, eu-west:3, us-east:3
In the keyspace we have a table called mytable that has only one row 'id' text
Now, we were doing some tests on the performance of the database.
In CQLSH with a consistency level of LOCAL_QUORUM we were doing some inserts with TRACING ON and we noticed that the requests were not working as we expected them.
From the tracing data we found out that the coordinator node was hitting as expected 2 other local nodes and was also sending a request to one of the remote datacenter nodes. Now the problem here was that the coordinator was waiting not only for the local nodes ( who finished in no time ) but for the remote nodes too.
Now since our 2 datacenters are geographically far away from each other, our requests were taking a very long time to complete.
Notes:
- This does not happen with DSE but our understanding was we don't need to pay crazy money for LOCAL_QUORUM to work as is expected

There is a high probability that you're hitting CASSANDRA-9753 when the non-zero dclocal_read_repair_chance will trigger a query against remote DC. You need to check the trace for hint about triggering of read repair for your query. If you really get it, then you can set dclocal_read_repair_chance to 0 - this parameter is deprecated anyway...

For functional and performance tests it would be better to use the driver instead of CQLSH, as most of the time that will be the way that you are interacting with the database.
For this case, you may use a DC-aware policy like
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withLoadBalancingPolicy(
DCAwareRoundRobinPolicy.builder()
.withLocalDc("myLocalDC")
.build()
).build();
This is modified from the example here, where all the clauses that allow to interact with remote datacenters are removed, as your purpose is to isolate the calls to local.

Related

Cassandra DB - Node is down and a request is made to fetch data in that Node

If we configured our replication factor in such a way that there are no replica nodes (Data is stored in one place/Node only) and if the Node contains requested data is down, How will the request be handled by Cassandra DB?
Will it return no data or Other nodes gossip and somehow pick up data from failed Node(Storage) and send the required response? If data is picked up, Will data transfer between nodes happen as soon as Node is down(GOSSIP PROTOCOL) or after a request is made?
Have researched for long time on how GOSSIP happens and high availability of Cassandra but was wondering availability of data in case of "No Replicas" since I do not want to waste additional Storage for occasional failures and at the same time, I need availability and No data loss(though delayed)
I assume when you say that there is "no replica nodes" you mean that you have set the Replication Factor=1. In this case if the request is a Read then it will fail, if the request is a write it will be stored as a hint, up to the maximum hint time, and will be replayed. If the node is down for longer than the hint time then that write will be lost. Hinted Handoff: repair during write path
In general only having a single replica of data in your C* cluster goes against some the basic design of how C* is to be used and is an anti-pattern. Data duplication is a normal and expected part of using C* and is what allows for it's high availability aspects. Having an RF=1 introduces a single point of failure into the system as the server containing that data can go out for any of a variety of reasons (including things like maintenance) which will cause requests to fail.
If you are truly looking for a solution that provides high availability and no data loss then you need to increase your replication factor (the standard I usually see is RF=3) and setup your clusters hardware in such a manner as to reduce/remove potential single points of failure.

Cassandra throws ReadTimeout exception for LOCAL_QUORUM queries when one of three replicas goes down

I have cassandra cluster of 3 nodes in one dc. Each keyspace is configured with replication factor 3.
All my queries are executed with LOCAL_QUORUM consistency. If one of my nodes goes down (for test purposes I just kill it via shell command) during read request, this request fails with ReadTimeoutException, saying that only one replica responded (2 expected), but all consequent read requests return data.
In my understanding this error shouldn't happen, because two nodes are still running and it should be enough for LOCAL_QUORUM consistency. How can I fix the exception?

How data will be consistent on cassandra cluster

I have a doubt when i read datastax documentation about cassandra write consistency. I have a question on how cassandra will maintain consistent state on following scenario:
Write consistency level = Quorum
replication factor = 3
As per docs, When a write occurs coordinator node will send this write request to all replicas in a cluster. If one replica succeed and other fails then coordinator node will send error response back to the client but node-1 successfully written the data and that will not be rolled back.
In this case,
Will read-repair (or hinted-handoff or nodetool repair) replicate the inconsistent data from node-1 to node-2 and node-3?
If not how will cassandra takes care of not replicating inconsistent data to other replicas?
Can you please clarify my question
You are completely right, the read repair or other methods will update the node-2 and node-3.
This means even the failed write will eventually update other nodes (if at least one succeeded). Cassandra doesn't have anything like rollback that relational databases have.
I don't see there is anything wrong - the system does what you tell it, i.e., two override one, and since the error messages sent back to the client as "fail", then the ultimate status should be "fail" by read repair tool.
Cassandra Coordinator node maintains the failed replica data in its storage and it will retry periodically (3 times or so) then if it succeeds then it will send the latest data, otherwise it will truncate the data in its storage.
In case of any read query, Coordinator node sends requests to all the replica nodes, and it will compare the results from all the replica nodes. If one of the replica node is not sending the latest data, then it will send read repair command to that node in order to keep the nodes in sync.

Cassandra - reading with consistency level ONE

How is reading with CL ONE implemented by Cassandra?
Does coordinator query all replicas and waits for the first to answer?
According to documentation, coordinator should query one single closest replica. What happens if timeout occurs during this query - does it try another replica, or it returns error to client?
Does coordinator query all replicas and waits for the first to answer?
As you mentioned, it queries the closest node, as determined by the snitch.
What happens if timeout occurs during this query
There is additional documentation on the Dynamic Snitch, which states that:
By default, all snitches also use a dynamic snitch layer that monitors
read latency and, when possible, routes requests away from
poorly-performing nodes.
By that definition, if the node chosen by the snitch should fail, the snitch should route the transaction to the [next] closest node.
Note that as of 2.0.2, Cassandra has a feature called Rapid Read Protection, which:
[A]llows Cassandra to tolerate node failure without dropping a single request

Cassandra - write with CL.ALL with multiple data centers

I have two Data Centers, each one with replication factor 3.
Will write with CL.ALL block until data is stored in both DCs (6 nodes, or 3 + 1)?
I would assume, that it blocks until all 3 replicas in local DC has acknowledged successful write.
I would like to have something like CL.ALL_LOCAL, which stores data on all replicas in single DC, so I can read with CL.ONE. The idea is, that write blocks until all replicas in single DC has persisted data, and following read will have high probability to read fresh data
There isn't currently a consistency level that provides what you are describing. The closest is LOCAL_QUORUM which will return after a quorum of nodes in the local datacenter respond.
You can file a ticket on jira to add this functionality if you would like.
https://issues.apache.org/jira/browse/CASSANDRA
I've checked Cassandra 1.1 code and noticed interesting behavior when writing with CL.ALL in multi DC deployment. Probably I've interpreted code wrong.... anyway:
on the beginning they are collecting IP addresses of nodes to send row mutation - this is independent from consistency level provided by the client. In 1.0 it were all nodes from all DCs, from 1.1 they get all nodes from local DC plus one node from each remote DC (the remaining nodes are as "forward to" in the message). Each mutation will be send by separate thread, so the requests can run in parallel. Each such mutation is being handled as a message by messaging service. When node in remote DC receives message, it forwards it to remaining nodes, which are provided in "forward to".
The consistency level provided by the client, defines number of nodes which must acknowledge received message. In case of CL.ALL this number is determined by replication factor - now is getting interesting: since we've send message to all nodes from local DC and to nodes from remote DCs, we will get also acknowledgement from those remove nodes too - yes this is still the number which is defined by replication factor, but depending on notwork latency, we can not be sure which nodes has conformed received message - could be mix from nodes from local and remote DC, but could be also only nodes from local DC. In the worst case, it could happen, that none of the local nodes got the message, and confirmation come from remote DCs (if you have many). This means - writing with CL.ALL does not grantee, that you can immediately read message from your local DC.

Resources