If all replicas will sync up eventually, what's the point of read repairs? - cassandra

If all replicas will sync up eventually, what's the point of read repairs?
Wouldn't you have cases where if you have a write that's being sent to all replicas, then a read repair happens before the write, wouldn't you essentially be duplicating that same write twice?

Theres a few things, blocking read repairs, async read repairs, and if either are needed.
Blocking read repairs: Quorum reads are monotonically consistent for awhile now. If you read it twice you should get the same answer. People tend to use QUORUM reads as wanting stronger consistency, so the blocking read repairs prevent reads from going back in time. If this behavior ends it would cause potential surprises to existing applications. However the latency impact of these repairs have been causing issues and it may still be changed in very near future. You cannot currently disable this behavior and it will always be on.
Async read repairs: Repairs in background can be disabled or happen only a small percentage of time, or (recommended) only within a DC. This reduces background impact as there isnt as much cross DC traffic. This is controlled by the dc_local and global read repair chance settings. When you do a ONE or LOCAL_ONE etc query it will depending on that chance wait for the rest of the responses and compare results for a read repair.
Statistically your far more likely to be having unnecessary work with async read repairs as on a normal functioning perfect system they are not needed. Hinted Handoff however is not perfect and there are cases where hints are lost. In these situations the consistency will not be met until a anti-entropy repair is run (should be weekly or even daily depending on how repairs run, inc or full etc).
So other than for the sake of monotonic consistency (blocking on QUORUM+ requests), read repairs are not really critical or needed. Its something people have used to statistically put cluster in a more consistent state faster (maybe). Lots of people run without async read repairs (you cannot currently disable the read repair mechanism fwiw), and theres even serious talk of removing options around it completely due to misunderstandings.

One scenario that makes sense to me is this:
You write the data to a node (or a subset of the cluster)
You read the data (with Quorum), and one of the nodes has the fresher data.
because you specified QUROUM, several nodes are being asked for the value before the response is sent to the client. But because the data is fresher on one of the nodes, a blocking read-repair must first happen, to update all of them.
in this case, the read-repair needs to happen because the "eventual update" has yet to happen.

In highly dynamic applications with many nodes, there are times when an eventually consistent write doesn't make it to the node PRIOR to a read request for that piece of data on that node. This is common in environments with heavy load on an undersized cluster, unknown hardware issues and other reasons. Its likely also where write consistency is set to one, while read consistency is set to local_quorum.
Example 1: random & sporadic network drops due to an unknown network switch failing that affects the write to the node but doesn't affect the read.
Example 2: the write occurs during a peak load time period, and as a result doesn't make it to the overloaded node, prior to the read request.

Related

BLOCKING read repair blocks writes on other replicas?

Learning Cassandra. There are a couple of things about read repair that I don't understand.
The docs say this about BLOCKING read repair:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
To be honest, this entire sentence just doesn't make sense to me. First, why would read repair need to block writes? Isn't read repair in essence just a simple write of reconciled data? Second, how can read repair block writes on another replicas?
The docs also say that BLOCKING read repair breaks partition level write atomicity.
Cassandra attempts to provide partition level write atomicity, but since only the data covered by a SELECT statement is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it is written. For example, read repair can break write atomicity if you write multiple rows to a clustered partition in a batch, but then select a single row by specifying the clustering column in a SELECT statement.
Again, I don't understand how write atomicity gets broken. Single-partition batch is atomic and isolated, right? Can someone explain it more?
What implications this breaking of atomicity has for developers? I mean, it sure doesn't sound good.
EDIT:
For the first question see the accepted answer. For the second question this issue explains how atomicity gets broken.
I can see where the docs are a bit confusing. Allow me to expand on the subject and hopefully clarify it for you.
The wording in this paragraph could probably use a rewrite:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
It's referred to as a blocking read-repair because the reads are blocked (result is not returned to the client/driver by the coordinator) until the problematic replicas are repaired. The mutation/write is sent to the offending replica and the replica must acknowledge that the write is successful (i.e. persisted to commitlog).
The read-repair does not block ordinary writes -- it's just that the read request by the coordinator is blocked until the offending replica(s) involved in the request is repaired.
For the second part of your question, it's an extreme case where that scenario would take place because it's really a race condition between the batch and the read-repair. I've worked on a lot of clusters and I've never ran into that situation (maybe I'm just extremely lucky 🙂). I've certainly never had to worry about it before.
It has to be said that read-repairs are a problem because replicas miss mutations. In a distributed environment, you would expect the odd dropped mutation. But if it's a regular occurrence in the cluster, read-repair is the least of your worries since you probably have a bigger underlying issue -- unresponsive nodes from long GC pauses, commitlog disks not able to keep up with writes. Cheers!

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Dealing with eventual consistency in Cassandra

I have a 3 node cassandra cluster with RF=2. The read consistency level, call it CL, is set to 1.
I understand that whenever CL=1,a read repair would happen when a read is performed against Cassandra, if it returns inconsistent data. I like the idea of having CL=1 instead of setting it to 2, because then even if a node goes down, my system would run fine. Thinking by the way of the CAP theorem, I like my system to be AP instead of CP.
The read requests are seldom(more like 2-3 per second), but are very important to the business. They are performed against log-like data(which is immutable, and hence never updated). My temporary fix for this is to run the query more than once, say 3 times, instead of running it once. This way, I can be sure that that even if I don't get my data in the first read request, the system would trigger read repairs, and I would eventually get my data during the 2nd or 3rd read request. Ofcourse, these 3 queries happen one after the other, without any blocking.
Is there any way that I can direct Cassandra to perform read repairs in the background without having the need to actually perform a read request in order to trigger a repair?
Basically, I am looking for ways to tune my system in a way as to circumvent the 'eventual consistency' model, by which my reads would have a high probability of succeeding.
Help would be greatly appreciated.
reads would have a high probability of succeeding
Look at DowngradingConsistencyRetryPolicy this policy allows retry queries with lower CL than the initial one. With this policy your queries will have strong consistency when all nodes are available and you will not lose availability if some node is fail.

Is Repair needed if all operations are quorum

Is Repair really needed if all operations execute at quorum.
Repair is generally needed to ensure all nodes are in sync, but quorum already ensures success is only returned when the quorum is in sync.
So if all operations execute at quorum, then do we need repair?
In our use-case, we never update records, we simply add then delete the record. (If we see the message after a 'delete' failure is ok, it is not disastrous).In fact - a repair could bring the record back to life..that would be undesirable (but not disastrous)
I would think with this situation, unless there was corruption of one of the nodes, we would not need repair.
I would also argue with this setup, even if delete succeeded, and we saw the record again, it would not be a 'big-deal'. As such I think we could in fact set gc_grace=0, if the quroum operation succeeded, then only 2 would be left..which would never give us quorum against those 'offending nodes, as such we would never see those records anyways (unless..a node dies).
So if a node dies post delete (assume 5 nodes 3 for quorum),
then we have 'stale-mate' 2vs2 and cannot achieve quorum, however hint-repair would kick if one of those records were read again (I'm not clear if this WILL run, or only runs the configured chance amount I.E. 10% is the default if we had quorum failure?).
Either with if gc_grace=0, it would likely come back to life after the delete, so maybe having gc_grace=24 hours (to allow read-repair to correct) would reduce the chance of seeing the record again.
Thoughts?
Your basic thought process is sound - if you write with quorum and read with quorum and never overwrite, then yes, you can likely get by without repair.
You MAY need to run repair if you have to rebuild a failed node, as it's possible that the replacement could miss one of the replicas, and you'd be left with one of three, which may be missed upon read. If that happens, having run incremental repairs previously would make subsequent repairs faster, but it's not strictly necessary.
Your final two paragraphs aren't necessarily accurate - your logic in those is flawed (with 5 nodes and 1 dying, there is no 2v2 stalemate for quorum, that's fundamentally misunderstanding how quorum works). Hints are also best effort and only within a limited window, and read repair isn't guaranteed unless you change read repair to non-default settings.

Using Cassandra as a Queue

Using Cassandra as Queue:
Is it really that bad?
Setup: 5 node cluster, all operations execute at quorum
Using DateTieredCompaction should significantly reduce the cost of TombStones, and allow entire SSTables to be dropped at once.
We add all messages to the queue with the same TTL
We partition messages based on time (say 1 minute intervals), and keep track of the read-position.
Messages consumed will be explicitly deleted. (only 1 thread extracts messages)
Some Messages may be explicitly deleted prior to being read (i.e. we may have tombstones after the read-position). (i.e. the TTL initially used is an upper limit) gc_grace would probably be set to 0, as quorum reads will do blocking-repair (i.e. we can have repair turned off, as messages only reside in 1 cluster (DC), and all operations a quorum))
Messages can be added/deleted only, no updates allowed.
In our use case, if a tombstone does not replicate its not a big deal, its ok for us to see the same message multiple times occasionally. (Also we would likely not run Repair on regular basis, as all operations are executing at quorum.)
Thoughts?
Generally, it is an anti-pattern, this link talks much of the impact on tombstone: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
My opinion is, try to avoid that if possible, but if you really understand the performance impact, and it is not an issue in your architecture, of course you could do that.
Another reason to not do that if possible is, the cassandra data structure is not designed for queues, it will always look ugly, UGLY!
Strongly suggest to consider Redis or RabbitMQ before making your final decision.

Resources