Does Cassandra read have inconsistency? - cassandra

I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?

For Tuning consistency cassandra allows to set the consistency on per query basis.
Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.
During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.
By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.
Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).
Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.
This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.
READ With Different CONSISTENCY LEVEL
READ Request in Cassandra

Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.

Related

How does Cassandra handle inconsistencies between two replicas?

I have a simple question on the strategy Cassandra opted for when the following scenario happen
Scenario
At T1, replica 1 receives the write mutation like name = amit, language = english
At T1 + 1, replica 2 receives the update like language = japanese where name = amit
Assume, that if the write record is not replicated on replica 2 when the update for the record has come, then how does Cassandra handles the scenario.
My Guess - May be replica 2 will check the lamport timestamp of
update message say it 102 and ask replica 1 for any record which
is less than 102 so that it ( replica 2 ) can execute them first
then execute the update statement.
Any help would be appreciated.
Under the hood (for normal operations, not LWTs) both INSERTs and UPDATEs are UPSERTs - they aren't dependent on the previous state to perform update of the data. When you perform UPDATE, then Cassandra just put the corresponding value without checking if corresponding primary key exists, and that's all. And even if the earlier operations come later, Cassandra will check the "write time" to resolve the conflict.
For your case, it will go as following:
replica 1 receives the write, and retransmit it to other replicas in the cluster, including replica 2. If replica 2 isn't available in that moment, the mutation will be written as a hint that will be replayed when replica 2 is up.
replica 2 may receive new updates, and also retransmit it to other replicas.
The coordinator deals with the inconsistencies depending on the consistency level (CL) used. There are also other nuanced behaviour which again are tied to the consistency level for read and write requests.
CASE A - Failed writes
If your application uses a weak consistency of ONE or LOCAL_ONE for writes, the coordinator will (1) return a successful write response to the client/driver even if just one replica acknowledges the write, and (2) will store a hint for the replica(s) which did not respond.
When the replica(s) is back online, the coordinator will (3) replay the hint (resend the write/mutation) to the replica to keep it in sync with other replicas.
If your application uses a strong consistency of LOCAL_QUORUM or QUORUM for writes, the coordinator will (4) return a successful write response to the client/driver when the required number of replicas have acknowledged the write. If any replicas did not respond, the same hint storage in (2) and hint replay in (3) applies.
CASE B - Read with weak CL
If your application issues a read request with a CL of ONE or LOCAL_ONE, the coordinator will only ever query one replica and will return the result from that one replica.
Since the app requested the data from just one replica, the data does NOT get compared to any other replicas. This is the reason we recommend using a strong consistency level like LOCAL_QUORUM.
CASE C - Read with strong CL
For a read request with a CL of LOCAL_QUORUM against a keyspace with a local replication factor of 3, the coordinator will (5) request the data from 2 replicas. If the replicas don't match, the (6) data with the latest timestamp wins, and (7) a read-repair is triggered to repair the inconsistent replica.
For more info, see the following documents:
How read requests are accomplished in Cassandra
How writes are accomplished in Cassandra

Cassandra writes consistency, what happens after “transaction” fails?

I didn’t get from the documentation what happens after a node fails during writes.
I got the idea of quorum but what happens after “the write transaction” fails?
For example:
I inserted the record and chose the level of consistency equal to QUORUM.
Assume QUORUM = 3 nodes and 2 of 3, or just 1 of 3 nodes wrote the date but the rest didn’t and failed.
I got an error.
What happens with the record on the nodes which wrote it?
How can Casandra prevent propagating this row to other nodes through a replica synchronization?
Or if I get errors on writing it actually means that this row could appear within some time on each replica?
Cassandra doesn't have transactions (except light-weight transactions, that are also different kind of thing). When some node received & written the data, and other not - there is no rollback or something like this. This data is written. But coordinator node sees that consistency level couldn't be reached, and report error back to client application saying about it, so it could be retried if necessary. If it's not retried, then data could be propagated through the repair operations - either read repair, or through explicit repair. But because the data is on a single node, this means that this node mail fail before repair happens, and data could be lost.

Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Q1.
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

Cassandra's atomicity and "rollback"

The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Resources