Learning Cassandra. There are a couple of things about read repair that I don't understand.
The docs say this about BLOCKING read repair:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
To be honest, this entire sentence just doesn't make sense to me. First, why would read repair need to block writes? Isn't read repair in essence just a simple write of reconciled data? Second, how can read repair block writes on another replicas?
The docs also say that BLOCKING read repair breaks partition level write atomicity.
Cassandra attempts to provide partition level write atomicity, but since only the data covered by a SELECT statement is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it is written. For example, read repair can break write atomicity if you write multiple rows to a clustered partition in a batch, but then select a single row by specifying the clustering column in a SELECT statement.
Again, I don't understand how write atomicity gets broken. Single-partition batch is atomic and isolated, right? Can someone explain it more?
What implications this breaking of atomicity has for developers? I mean, it sure doesn't sound good.
EDIT:
For the first question see the accepted answer. For the second question this issue explains how atomicity gets broken.
I can see where the docs are a bit confusing. Allow me to expand on the subject and hopefully clarify it for you.
The wording in this paragraph could probably use a rewrite:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
It's referred to as a blocking read-repair because the reads are blocked (result is not returned to the client/driver by the coordinator) until the problematic replicas are repaired. The mutation/write is sent to the offending replica and the replica must acknowledge that the write is successful (i.e. persisted to commitlog).
The read-repair does not block ordinary writes -- it's just that the read request by the coordinator is blocked until the offending replica(s) involved in the request is repaired.
For the second part of your question, it's an extreme case where that scenario would take place because it's really a race condition between the batch and the read-repair. I've worked on a lot of clusters and I've never ran into that situation (maybe I'm just extremely lucky 🙂). I've certainly never had to worry about it before.
It has to be said that read-repairs are a problem because replicas miss mutations. In a distributed environment, you would expect the odd dropped mutation. But if it's a regular occurrence in the cluster, read-repair is the least of your worries since you probably have a bigger underlying issue -- unresponsive nodes from long GC pauses, commitlog disks not able to keep up with writes. Cheers!
Related
If all replicas will sync up eventually, what's the point of read repairs?
Wouldn't you have cases where if you have a write that's being sent to all replicas, then a read repair happens before the write, wouldn't you essentially be duplicating that same write twice?
Theres a few things, blocking read repairs, async read repairs, and if either are needed.
Blocking read repairs: Quorum reads are monotonically consistent for awhile now. If you read it twice you should get the same answer. People tend to use QUORUM reads as wanting stronger consistency, so the blocking read repairs prevent reads from going back in time. If this behavior ends it would cause potential surprises to existing applications. However the latency impact of these repairs have been causing issues and it may still be changed in very near future. You cannot currently disable this behavior and it will always be on.
Async read repairs: Repairs in background can be disabled or happen only a small percentage of time, or (recommended) only within a DC. This reduces background impact as there isnt as much cross DC traffic. This is controlled by the dc_local and global read repair chance settings. When you do a ONE or LOCAL_ONE etc query it will depending on that chance wait for the rest of the responses and compare results for a read repair.
Statistically your far more likely to be having unnecessary work with async read repairs as on a normal functioning perfect system they are not needed. Hinted Handoff however is not perfect and there are cases where hints are lost. In these situations the consistency will not be met until a anti-entropy repair is run (should be weekly or even daily depending on how repairs run, inc or full etc).
So other than for the sake of monotonic consistency (blocking on QUORUM+ requests), read repairs are not really critical or needed. Its something people have used to statistically put cluster in a more consistent state faster (maybe). Lots of people run without async read repairs (you cannot currently disable the read repair mechanism fwiw), and theres even serious talk of removing options around it completely due to misunderstandings.
One scenario that makes sense to me is this:
You write the data to a node (or a subset of the cluster)
You read the data (with Quorum), and one of the nodes has the fresher data.
because you specified QUROUM, several nodes are being asked for the value before the response is sent to the client. But because the data is fresher on one of the nodes, a blocking read-repair must first happen, to update all of them.
in this case, the read-repair needs to happen because the "eventual update" has yet to happen.
In highly dynamic applications with many nodes, there are times when an eventually consistent write doesn't make it to the node PRIOR to a read request for that piece of data on that node. This is common in environments with heavy load on an undersized cluster, unknown hardware issues and other reasons. Its likely also where write consistency is set to one, while read consistency is set to local_quorum.
Example 1: random & sporadic network drops due to an unknown network switch failing that affects the write to the node but doesn't affect the read.
Example 2: the write occurs during a peak load time period, and as a result doesn't make it to the overloaded node, prior to the read request.
I cannot find much documentation about it. The only thing I can find is that when the consistency level is not set to EACH_QUORUM, cross region replication is done asynchronously.
But in asynchronous style, is it possible to lose messages? How does Cassandra handle losing messages?
If you don't use EACH_QUORUM and a destination node which would accept a write is down, then coordinator node is saving writes as "hinted handoffs".
When destination node becomes available again, coordinator replays hinted handoffs on destination.
For any occasion when hinted handoffs are lost, you have to do run a repair on your cluster.
Also you have to be aware of that storing hints is allowed for maximum of 3 hours by defaults.
For further info see documentation at:
http://www.datastax.com/dev/blog/modern-hinted-handoff
http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesHintedHandoff.html
Hope this helps.
When you issue a write in Cassandra, the coordinator sends the write to all online replicas, and then blocks. The duration of the block corresponds to consistency level - if you say "ALL", it blocks until all nodes ack the write. If you use "EACH_QUORUM", it blocks until a quorum of nodes in each datacenter ack the write.
For any replica that didn't ack the write, the coordinator will write a hint, and attempt to deliver that hint later (minutes, hours, no guarantee).
Note, though, that the writes were all sent at the same time - what you don't have is a guarantee as to which were delivered. Your guarantee is in the consistency level.
When you read, you'll do something similar - you'll block until you have an appropriate number of replicas answering. If you write with EACH_QUORUM, you can read with LOCAL_QUORUM and guarantee strong consistency. If you write with QUORUM, you could read with QUORUM. If you write with ONE, you could still guarantee strong consistency if you read with ALL.
To guarantee eventual consistency, you don't have to do anything - it'll eventually get there, as long as you wrote with CL >= ONE (CL ANY isn't really a guarantee).
This is a two-part question regarding nodetool repair and garbage collection.
Let's consider a replication factor of 3 for all tables, and suppose reads and writes require two confirmations of success to succeed. Based on my understanding of Cassandra, a successful write or delete would never be in danger of being missed as long as a read requires at least two responses, accepting only only the latest timestamp. This makes sense to me, but is it correct?
As a closely related question, if I configure Cassandra never to perform GC, but still perform nodetool repair periodically, will this suffice to garbage-collect old tombstones? Intuitively, a successfully repaired key range should not need to keep tombstones, so they could in theory be discarded when a repair is performed. Is this the case?
If my above two hypotheses are correct, it seems like we can achieve the following:
Consistent reads and writes with no resurrected data (due to quorum reads and writes and avoiding GC completely)
No unbounded growth in stale tombstones (due to periodically running nodetool repair, which hopefully performs GC if my above hypothesis is correct)
This post explains that quorum doesn't guarantee consistency: Read Operation in Cassandra at Consistency level of Quorum?
Assuming "GC" means compaction, I don't think nodetool repair will suffice to delete tombstones or take care of other compaction tasks. https://issues.apache.org/jira/browse/CASSANDRA-6602 describes a compaction-less scenario that sounds like what you're considering. If this is what you're doing, the recommended solution is to use DateTieredCompactionStrategy (DTCS) to store data written within a certain period of time in the same SSTable. DTCS was released in Cassandra 2.1.1 today and is described here: http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/tabProp.html?scroll=tabProp__moreCompaction
The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
If I understand it correctly, upon a write request the write is sent to all N replicas, and the operation succeeds when the first W responses are received. Is this correct?
If it is, then combined with Hinted Handoff, it seems that all replicas will already get all writes as soon as possible, do we really have to do read repair in this case?
Thanks.
Short answer: you still need read repair.
Longer answer: there wasn't a good discussion of Hinted Handoff anywhere, so I wrote one.
For Cassandra 1.0+, read the updated article. The crucial part being:
At first glance, it may appear that Hinted Handoff lets you safely get away without needing repair. This is only true if you never have hardware failure.
It is possible for hinted handoff to fail for various reasons. Such as the node the hint was written to can fail. With read repair enabled if hinted handoff doesn't work for some reason read repair will fix it. And then you should also run "nodetool repair" on your nodes to catch any cases where read repair and hinted handoff both fail to fix all the data.
Check the wiki for more info.
http://wiki.apache.org/cassandra/AntiEntropy
http://wiki.apache.org/cassandra/HintedHandoff
The consistency level can be varied for each write (and read).
For example, let's say we have 10 nodes, with a replication factor of 3.
But if we write with a consistency level of ANY, none of the eventual 3 replicas may initally have the data when the write call returns. If we use consistency level ONE, then only one of the eventual 3 replicas has to have the data before the write returns, so a read straight after the write may see outdated data if the read has a low consistency level.
See http://wiki.apache.org/cassandra/API for the definitions of the consistency levels, particularly the following:
Read level ONE: Will return the record
returned by the first replica to
respond. A consistency check is always
done in a background thread to fix any
consistency issues when
ConsistencyLevel.ONE is used. This
means subsequent calls will have
correct data even if the initial read
gets an older value. (This is called
ReadRepair)
See also http://wiki.apache.org/cassandra/ReadRepair :
Read repair means that when a query is
made against a given key, we perform a
digest query against all the replicas
of the key and push the most recent
version to any out-of-date replicas.
If a low ConsistencyLevel was
specified, this is done in the
background after returning the data
from the closest replica to the
client; otherwise, it is done before
returning the data.