How does Cassandra handle inconsistencies between two replicas? - cassandra

I have a simple question on the strategy Cassandra opted for when the following scenario happen
Scenario
At T1, replica 1 receives the write mutation like name = amit, language = english
At T1 + 1, replica 2 receives the update like language = japanese where name = amit
Assume, that if the write record is not replicated on replica 2 when the update for the record has come, then how does Cassandra handles the scenario.
My Guess - May be replica 2 will check the lamport timestamp of
update message say it 102 and ask replica 1 for any record which
is less than 102 so that it ( replica 2 ) can execute them first
then execute the update statement.
Any help would be appreciated.

Under the hood (for normal operations, not LWTs) both INSERTs and UPDATEs are UPSERTs - they aren't dependent on the previous state to perform update of the data. When you perform UPDATE, then Cassandra just put the corresponding value without checking if corresponding primary key exists, and that's all. And even if the earlier operations come later, Cassandra will check the "write time" to resolve the conflict.
For your case, it will go as following:
replica 1 receives the write, and retransmit it to other replicas in the cluster, including replica 2. If replica 2 isn't available in that moment, the mutation will be written as a hint that will be replayed when replica 2 is up.
replica 2 may receive new updates, and also retransmit it to other replicas.

The coordinator deals with the inconsistencies depending on the consistency level (CL) used. There are also other nuanced behaviour which again are tied to the consistency level for read and write requests.
CASE A - Failed writes
If your application uses a weak consistency of ONE or LOCAL_ONE for writes, the coordinator will (1) return a successful write response to the client/driver even if just one replica acknowledges the write, and (2) will store a hint for the replica(s) which did not respond.
When the replica(s) is back online, the coordinator will (3) replay the hint (resend the write/mutation) to the replica to keep it in sync with other replicas.
If your application uses a strong consistency of LOCAL_QUORUM or QUORUM for writes, the coordinator will (4) return a successful write response to the client/driver when the required number of replicas have acknowledged the write. If any replicas did not respond, the same hint storage in (2) and hint replay in (3) applies.
CASE B - Read with weak CL
If your application issues a read request with a CL of ONE or LOCAL_ONE, the coordinator will only ever query one replica and will return the result from that one replica.
Since the app requested the data from just one replica, the data does NOT get compared to any other replicas. This is the reason we recommend using a strong consistency level like LOCAL_QUORUM.
CASE C - Read with strong CL
For a read request with a CL of LOCAL_QUORUM against a keyspace with a local replication factor of 3, the coordinator will (5) request the data from 2 replicas. If the replicas don't match, the (6) data with the latest timestamp wins, and (7) a read-repair is triggered to repair the inconsistent replica.
For more info, see the following documents:
How read requests are accomplished in Cassandra
How writes are accomplished in Cassandra

Related

How does the cassandra know that it has completed QUORUM?

I have always used Cassandra in spark applications, but I never wondered how it works internally. Reading the Cassandra documentation I got a small doubt (which may be a beginner's doubt).
I read in a book (Cassandra The Definitive Guide) and in the official Cassandra documentation that the formula would be:
(RF / 2) + 1.
So theoretically if I have a cluster with 6 nodes, and a replication factor of 3, I would only need response from 2 nodes.
And here come the small doubts:
1 - What would this response be? (The query return with the data?)
2 - If there was no data with the filters used in the query, is the empty return considered a response?
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?
Your reasoning sounds correct to me.
Basically, if you're reading at LOCAL_QUORUM and have an RF of 3, it's possible that the coordinator accepts a response from two replicas that are both inconsistent and leaves out the third replica that had consistent data.
It's one of the reasons Cassandra is considered an eventually consistent db, and also why regular repairs of the data are so important for production databases. Of course, if consistency mattered above all else, you could always read with a CL of ALL, but you'd sacrifice some amount of response time as a tradeoff. Assuming the db is provisioned well though, while it's certainly in the realm of possible, it isn't likely that only a single replica receives an incoming write unless you make a habit an only writing at a CL of ONE/LOCAL_ONE. If consistency mattered, you'd be writing to the db with a CL of at least LOCAL_QUORUM to avoid this very scenario.
To try and answer your questions directly, yes, having no data to return can be a valid response, and yes if the two replicas chosen by the coordinator both agree there is no data to return, the app will report that result.
1 - What would this response be? (The query return with the data?)
The coordinator node will wait for 2 replicas of the 3 (because CL=QUORUM) to respond to the query (with the request results). It will then send the response to the client.
2 - If there was no data with the filters used in the query, is the empty return considered a response?
Yes, the empty response will be sufficient and will be considered a valid response. Note that there is a mechanism last-write-wins (based on row write time) used in case of conflict.
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?
You have to understand that Apache Cassandra uses eventual consistency meaning that the client will decide on the desired CL. If you have a strong consistency, meaning you have an overlap of the write CL and read CL (Write CL + Read CL > RF), then will always retrieve the last data. I recommend you to watch this video: https://www.youtube.com/watch?v=Gx-pmH-b5mI

Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Q1.
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

Cassandra Read , Read repair

Scenario: Single data centre with replication factor 7 and read consistency level quorum.
During read request fastest replica gets a data request. But How many remaining replicas send the digest.
Q1 : Does all remaining (leaving fastest replica) replicas send the digest to coordinator. and the fastest 3 will be considered to satisfy the consistency. OR only 3 ((7 / 2 + 1) - 1(fastest) = 3) replicas will be chosen to send the digest.
Q2 : In both the case how read repair will work. How many and which nodes will get in sync after read repair runs.
This is taken from this excellent blog post which you should absolutely read: https://academy.datastax.com/support-blog/read-repair
There are broadly two types of read repair: foreground and background. Foreground here means blocking -- we complete all operations before returning to the client. Background means non-blocking -- we begin the background repair operation and then return to the client before it has completed.
In your case, you'll be doing a foreground read-repair as it is performed on queries which use a consistency level greater than ONE/LOCAL_ONE. The coordinator asks one replica for data and the others for digests of their data (currently MD5). If there's a mismatch in the data returned to the coordinator from the replicas, Cassandra resolves the situation by doing a data read from all replicas and then merging the results.
This is one of the reasons why it's important to make sure you continually have anti-entropy repair running and completing. This way, the chances of digest mismatches on reads are lower.

Does Cassandra read have inconsistency?

I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?
For Tuning consistency cassandra allows to set the consistency on per query basis.
Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.
During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.
By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.
Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).
Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.
This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.
READ With Different CONSISTENCY LEVEL
READ Request in Cassandra
Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.

Does Cassandra write to a node(which is up) even if Consistency cannot be met?

The below statement from Cassandra documentation is the reason for my doubt.
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
Ref : http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
So does Cassandra write to a node(which is up) even if Consistency cannot be met ?
I got it. Cassandra will not even attempt to write if it knows that consistency cannot be met. If consistency CAN be met, but does not have enough replicas to satisfy replication factor, then Cassandra would write to currently available replicas and gives a success message. Later when the replica is up again, it will write to other replica.
For e.g. If Replication factor is 3 , 1 of 3 nodes are down, then if I write with a Consistency of 2, the write will succeed. But if Replication factor is 2 and 1 of 2 nodes are down , then if I write with a Consistency of 2, Cassandra will not even write to that single node which is available.
What is mentioned in the documentation is a case where while write was initiated when the consistency can be met. But in between, one node went down and couldn't complete the write, whereas write succeeded in other node. Since consistency cannot be met, client would get a failure message. The record which was written to a single node would be removed later during node repair or compaction.
Consistency in Cassandra can (is?) be defined at statement level. That means you specify on a particular query, what level of consistency you need.
This will imply that if the consistency level is not met, the statement above has not met consistency requirements.
There is no rollback in Cassandra. What you have in Cassandra is Eventual consistency. That means your statement might be a success in future if not immediately. When a replica node comes a live, the cluster (aka the Cassandra's fault tolerance) will take care of writing to the replica node.
So, if your statement is failed, it might be succeeded in future. This is in contrary to the RDBMS world, where an uncommitted transaction is rolled back as if nothing has happened.
Update:
I stand corrected. Thanks Arun.
From:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html
During a write operation, when hinted handoff is enabled and consistency can be met, the coordinator stores a hint about dead replicas in the local system.hints table under either of these conditions:
So it's still not rollback. Nodes know the current cluster state and doesn't initiate the write if consistency cannot be met.
At driver level, you get an exception.
On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back.
In a normal situation, you can consider that the data was not written to any of the nodes.
From the documentation:
If the write fails on one of the nodes but succeeds on the other,
Cassandra reports a failure to replicate the write on that node.
However, the replicated write that succeeds on the other node is not
automatically rolled back.

Resources