I need to understand on read-repair for Cassandra 3.0. For example, I have three nodes A, B & C. My replication factor is 3. Now, I wrote with Quorum, and it successfully wrote on Node A & B, so client will receive success but somehow data was not written on Node C (it was down, and hints throttle time elapsed).
I have not run manual repair and my read repair change is 0.1.
After, few days, my node A is down, leaving me with Node B & Node C. So if I issue a read query with quorum, will read repair always write data to node C and return to client successfully or there may be a scenario, where client can receive an error of "unable to achieve consistency level".
If 2 out of 3 replicas are up, then, the Quorum consistency will be achieved therefore the client will be able to read data. As one of the nodes doesn't have any data therefore, read repair will happen.
As per my understanding(I'm new in Cassandra), whenever a query is executed in Cassandra, the coordinator node checks if the desired number of replicas(requested consistency) are able to respond to the query. If it happens, then the client receives the most recent version of the data(timestamps of the data returned by each node is compared) and then that recent version is written over all the remaining replicas via read repair in case of mismatch.
Related
I didn’t get from the documentation what happens after a node fails during writes.
I got the idea of quorum but what happens after “the write transaction” fails?
For example:
I inserted the record and chose the level of consistency equal to QUORUM.
Assume QUORUM = 3 nodes and 2 of 3, or just 1 of 3 nodes wrote the date but the rest didn’t and failed.
I got an error.
What happens with the record on the nodes which wrote it?
How can Casandra prevent propagating this row to other nodes through a replica synchronization?
Or if I get errors on writing it actually means that this row could appear within some time on each replica?
Cassandra doesn't have transactions (except light-weight transactions, that are also different kind of thing). When some node received & written the data, and other not - there is no rollback or something like this. This data is written. But coordinator node sees that consistency level couldn't be reached, and report error back to client application saying about it, so it could be retried if necessary. If it's not retried, then data could be propagated through the repair operations - either read repair, or through explicit repair. But because the data is on a single node, this means that this node mail fail before repair happens, and data could be lost.
I am trying to build cassandra backup and recovery process.
Let say I have 2 nodes A and B and table C with replica factor 2.
In table C we have row with ID=5 and Name="Alex".
Now , something bad happened to node B and we need to get it down for the few minutes to make a restore.
In the same time,while node B is down, someone change row with ID=5 form Name="Alex" to Name="Alehandro".
Node B up again , with restored data and respectively for this node row with ID=5 still contain Name="Alex".
What will happens when I try to find row with ID=5?
Will node A synchronize with node B?
Thanks.
Cassandra has several ways to synchronize data to nodes that were missed writes because they were down, or there was garbage collection pause, etc. This includes:
Hints - coordinator node for a some time (3 hours by default, configurable) will collect all write operations that other node has missed, and when it's back - these operations will be replayed against it
Repair - explicit synchronization of data, that is triggered via nodetool repair manually, or the tools like Reaper could be used to automate it
Read repair - if you're using consistency level that requires reading from the several nodes (TWO, LOCAL_QUORUM, QUORUM, etc.), then coordinator node will detect discrepancies, and will return data with the newest timestamp, if necessary, fixing the data on node that has old data
Answering your last question - when 2nd node is back, you can get old data if hints aren't replayed yet, and you're reading directly from that node, and you're reading with consistency level ONE or LOCAL_ONE.
P.S. I recommend to look through the DSE Architecture Guide - it covers how Cassandra works.
Assume I have a 2 node cluster with Replication Factor of 2
A write is initiated with Consistency 2
One node completes the write and another node fails just before completing the write.
Client receives failed response as consistency level cannot be met
Another client reads the same row with consistency 2
One node has latest data (failed record) and other node has old data so read repair is initiated
Q1. During read repair, will Cassandra discard the failed write even though it has the latest timestamp or will it propagate the value written during failed write at step 2 above as it is the value with latest timestamp?
Q2. what is outcome during read repair, if we replace the steps 1 to 3 with a write of consistency 1 and successful write? How read repair differentiates this from the partially failed write?
The text below is from datastax documention
https://docs.datastax.com/en/ddac/doc/datastax_enterprise/dbInternals/dbIntTransactionsDiffer.html
For example, if using a write consistency level of QUORUM with a replication factor of three, the database replicates the write to all nodes in the cluster and waits for acknowledgement from two nodes. If the write fails on one node but succeeds on another node, Cassandra reports a failure to replicate the write on that node, but the replicated write that succeeds on the other node is not automatically rolled back.
Cassandra uses client-side timestamps to determine the most recent update to a column. The latest timestamp always wins when requesting data, so if multiple client sessions update the same columns in a row concurrently, the most recent update is the one seen by readers
"Q1. During read repair, will Cassandra discard the failed write even though it has the latest timestamp or will it propagate the value written during failed write at step 2 above as it is the value with latest timestamp?"
A: Although write has failed since the required number of successful writes to nodes are not met, latest value stored in the node which had a successful write will still be persisted and not rolled back. And when the other nodes are online, a read repair is initiated and nodes will be updated with new data.
"Q2. what is outcome during read repair, if we replace the steps 1 to 3 with a write of consistency 1 and successful write? How read repair differentiates this from the partially failed write?"
A: If we have a write consistency of 1 and if one node is successfully updated with data, write is considered successful. When the other node comes online, read repair is initiated and value is added/updated in the failed node. Read will be successful as long as 2 nodes are alive, if data is not matching, latest data will be received and the other node is updated with the data.
You can take a look at this blog
I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?
For Tuning consistency cassandra allows to set the consistency on per query basis.
Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.
During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.
By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.
Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).
Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.
This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.
READ With Different CONSISTENCY LEVEL
READ Request in Cassandra
Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.
Let's say I have a 3 node cluster.
I am writing to node #1.
If node #2 in that cluster goes down, and then comes back up and is resyncing the data from the other nodes, and I continue writing to node #1, will the data be synchronously replicated to node #2? That is, is the replication factor of that write honored synchronously or is it behind the queue post resync?
Thanks
Steve
Yes granted that you are reading and writing at a consistency level that can handle 1 node becoming unavailable.
Consider the following scenario:
You have a 3 node cluster with a keyspace 'ks' with a replication factor of 3.
You are writing at a Consistency Level of 'QUORUM'
You are reading at a Consistency level of 'QUORUM'.
Node 2 goes down for 10 minutes.
Reads and Writes can successfully continue while node is down since 'QUORUM' only requires 2 (3/2+1=2) nodes to be available. While Node 2 is down, both Node 1 and 3 maintain 'hints' for Node 2.
Node 2 comes online. Node 1 and 3 send hints they recorded while Node 2 was down to Node 2.
If a read happens and the coordinating cassandra node detects that nodes are missing data/not consistent, it may execute a 'read repair'
If Node 2 was down for a long time, Node 1 and Node 3 may not retain all hints destined for it. In this case, an operator should consider running repairs on a scheduled basis.
Also note that when doing reads, if Cassandra finds that there is a data mismatch during a digest request, it will always consider the data with the newest timestamp as the right one (see 'Why cassandra doesn't need vector clocks').
Node2 will immediately start taking the new writes and also any hints stored for this node by others. It is good idea to run a read repair on the node after it is back up, which will ensure the data is accurate with other nodes.
Note that each column has a timestamp stored against it which will help cassandra determine which data is recent when running node repair.