Is Repair needed if all operations are quorum - cassandra

Is Repair really needed if all operations execute at quorum.
Repair is generally needed to ensure all nodes are in sync, but quorum already ensures success is only returned when the quorum is in sync.
So if all operations execute at quorum, then do we need repair?
In our use-case, we never update records, we simply add then delete the record. (If we see the message after a 'delete' failure is ok, it is not disastrous).In fact - a repair could bring the record back to life..that would be undesirable (but not disastrous)
I would think with this situation, unless there was corruption of one of the nodes, we would not need repair.
I would also argue with this setup, even if delete succeeded, and we saw the record again, it would not be a 'big-deal'. As such I think we could in fact set gc_grace=0, if the quroum operation succeeded, then only 2 would be left..which would never give us quorum against those 'offending nodes, as such we would never see those records anyways (unless..a node dies).
So if a node dies post delete (assume 5 nodes 3 for quorum),
then we have 'stale-mate' 2vs2 and cannot achieve quorum, however hint-repair would kick if one of those records were read again (I'm not clear if this WILL run, or only runs the configured chance amount I.E. 10% is the default if we had quorum failure?).
Either with if gc_grace=0, it would likely come back to life after the delete, so maybe having gc_grace=24 hours (to allow read-repair to correct) would reduce the chance of seeing the record again.
Thoughts?

Your basic thought process is sound - if you write with quorum and read with quorum and never overwrite, then yes, you can likely get by without repair.
You MAY need to run repair if you have to rebuild a failed node, as it's possible that the replacement could miss one of the replicas, and you'd be left with one of three, which may be missed upon read. If that happens, having run incremental repairs previously would make subsequent repairs faster, but it's not strictly necessary.
Your final two paragraphs aren't necessarily accurate - your logic in those is flawed (with 5 nodes and 1 dying, there is no 2v2 stalemate for quorum, that's fundamentally misunderstanding how quorum works). Hints are also best effort and only within a limited window, and read repair isn't guaranteed unless you change read repair to non-default settings.

Related

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Why do tables get out of sync over time when Write Consistency ALL is used?

Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".

Write consistency = ALL: do I still need weekly repairs to avoid zombie records?

As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL

If all replicas will sync up eventually, what's the point of read repairs?

If all replicas will sync up eventually, what's the point of read repairs?
Wouldn't you have cases where if you have a write that's being sent to all replicas, then a read repair happens before the write, wouldn't you essentially be duplicating that same write twice?
Theres a few things, blocking read repairs, async read repairs, and if either are needed.
Blocking read repairs: Quorum reads are monotonically consistent for awhile now. If you read it twice you should get the same answer. People tend to use QUORUM reads as wanting stronger consistency, so the blocking read repairs prevent reads from going back in time. If this behavior ends it would cause potential surprises to existing applications. However the latency impact of these repairs have been causing issues and it may still be changed in very near future. You cannot currently disable this behavior and it will always be on.
Async read repairs: Repairs in background can be disabled or happen only a small percentage of time, or (recommended) only within a DC. This reduces background impact as there isnt as much cross DC traffic. This is controlled by the dc_local and global read repair chance settings. When you do a ONE or LOCAL_ONE etc query it will depending on that chance wait for the rest of the responses and compare results for a read repair.
Statistically your far more likely to be having unnecessary work with async read repairs as on a normal functioning perfect system they are not needed. Hinted Handoff however is not perfect and there are cases where hints are lost. In these situations the consistency will not be met until a anti-entropy repair is run (should be weekly or even daily depending on how repairs run, inc or full etc).
So other than for the sake of monotonic consistency (blocking on QUORUM+ requests), read repairs are not really critical or needed. Its something people have used to statistically put cluster in a more consistent state faster (maybe). Lots of people run without async read repairs (you cannot currently disable the read repair mechanism fwiw), and theres even serious talk of removing options around it completely due to misunderstandings.
One scenario that makes sense to me is this:
You write the data to a node (or a subset of the cluster)
You read the data (with Quorum), and one of the nodes has the fresher data.
because you specified QUROUM, several nodes are being asked for the value before the response is sent to the client. But because the data is fresher on one of the nodes, a blocking read-repair must first happen, to update all of them.
in this case, the read-repair needs to happen because the "eventual update" has yet to happen.
In highly dynamic applications with many nodes, there are times when an eventually consistent write doesn't make it to the node PRIOR to a read request for that piece of data on that node. This is common in environments with heavy load on an undersized cluster, unknown hardware issues and other reasons. Its likely also where write consistency is set to one, while read consistency is set to local_quorum.
Example 1: random & sporadic network drops due to an unknown network switch failing that affects the write to the node but doesn't affect the read.
Example 2: the write occurs during a peak load time period, and as a result doesn't make it to the overloaded node, prior to the read request.

would Cassandra write be successful in the following situation?

I have two datacentres with 3 machines each. the replication factor is DC1:3, DC2:3 and all the inserts are made with write consistency = all.
So all data exists on all nodes (this is done to get reads to be the fastest).
But are there other problems with this set up that I might be missing? (except for writes being slow which im fine with)
For example, if a single node is down would all my writes fail? (Or can cassandra note down the writes for the failed node somewhere and bring it up to speed once its up?)
If a single node were down, then all your writes would fail. The consistency level specifies how many replicas you require for the write to be successful. So if you say ALL, and every node is a replica, then all the nodes would need to be up for it to succeed.
Usually you would do your writes with a lower consistency, like ONE. Cassandra will still write the data to all the nodes if they are up. If some of them are down, then the data may still get written to them (once they are back up) via hinted handoffs, read repair chance, and scheduled repairs.

Resources