There are 2 DCs each with 3 nodes, and the RF used for writes is 2 and reads its each_quorum. A lightweight transaction is used to ensure consistency of updates across DCs. Now what is happening is for certain records, hundreds (maybe thousands) of lwt updates are hitting the cluster around same time. What is happening is that all of these updates are failing with "Operation timed out - received only 0 responses", not even one attempt is able to change the status of that one record and its making everyone else fail. Ideally it would be better for the first attempt to go through the update and change the values so that subsequent lwt updates will not go through since the lwt values do not satisfy. Is there any way to achieve this?
Tried increasing cas_contention timeout but this not help except making all the transactions wait longer before failing. Used "local consistency" which made lwt run faster but this would not help in our case since we want strong consistency on both the DCs. Any alternatives?


Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

Write consistency = ALL: do I still need weekly repairs to avoid zombie records?

As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL

Dealing with eventual consistency in Cassandra

I have a 3 node cassandra cluster with RF=2. The read consistency level, call it CL, is set to 1.
I understand that whenever CL=1,a read repair would happen when a read is performed against Cassandra, if it returns inconsistent data. I like the idea of having CL=1 instead of setting it to 2, because then even if a node goes down, my system would run fine. Thinking by the way of the CAP theorem, I like my system to be AP instead of CP.
The read requests are seldom(more like 2-3 per second), but are very important to the business. They are performed against log-like data(which is immutable, and hence never updated). My temporary fix for this is to run the query more than once, say 3 times, instead of running it once. This way, I can be sure that that even if I don't get my data in the first read request, the system would trigger read repairs, and I would eventually get my data during the 2nd or 3rd read request. Ofcourse, these 3 queries happen one after the other, without any blocking.
Is there any way that I can direct Cassandra to perform read repairs in the background without having the need to actually perform a read request in order to trigger a repair?
Basically, I am looking for ways to tune my system in a way as to circumvent the 'eventual consistency' model, by which my reads would have a high probability of succeeding.
Help would be greatly appreciated.
reads would have a high probability of succeeding
Look at DowngradingConsistencyRetryPolicy this policy allows retry queries with lower CL than the initial one. With this policy your queries will have strong consistency when all nodes are available and you will not lose availability if some node is fail.

Is Repair needed if all operations are quorum

Is Repair really needed if all operations execute at quorum.
Repair is generally needed to ensure all nodes are in sync, but quorum already ensures success is only returned when the quorum is in sync.
So if all operations execute at quorum, then do we need repair?
In our use-case, we never update records, we simply add then delete the record. (If we see the message after a 'delete' failure is ok, it is not disastrous).In fact - a repair could bring the record back to life..that would be undesirable (but not disastrous)
I would think with this situation, unless there was corruption of one of the nodes, we would not need repair.
I would also argue with this setup, even if delete succeeded, and we saw the record again, it would not be a 'big-deal'. As such I think we could in fact set gc_grace=0, if the quroum operation succeeded, then only 2 would be left..which would never give us quorum against those 'offending nodes, as such we would never see those records anyways (unless..a node dies).
So if a node dies post delete (assume 5 nodes 3 for quorum),
then we have 'stale-mate' 2vs2 and cannot achieve quorum, however hint-repair would kick if one of those records were read again (I'm not clear if this WILL run, or only runs the configured chance amount I.E. 10% is the default if we had quorum failure?).
Either with if gc_grace=0, it would likely come back to life after the delete, so maybe having gc_grace=24 hours (to allow read-repair to correct) would reduce the chance of seeing the record again.
Your basic thought process is sound - if you write with quorum and read with quorum and never overwrite, then yes, you can likely get by without repair.
You MAY need to run repair if you have to rebuild a failed node, as it's possible that the replacement could miss one of the replicas, and you'd be left with one of three, which may be missed upon read. If that happens, having run incremental repairs previously would make subsequent repairs faster, but it's not strictly necessary.
Your final two paragraphs aren't necessarily accurate - your logic in those is flawed (with 5 nodes and 1 dying, there is no 2v2 stalemate for quorum, that's fundamentally misunderstanding how quorum works). Hints are also best effort and only within a limited window, and read repair isn't guaranteed unless you change read repair to non-default settings.

Using Cassandra as a Queue

Using Cassandra as Queue:
Is it really that bad?
Setup: 5 node cluster, all operations execute at quorum
Using DateTieredCompaction should significantly reduce the cost of TombStones, and allow entire SSTables to be dropped at once.
We add all messages to the queue with the same TTL
We partition messages based on time (say 1 minute intervals), and keep track of the read-position.
Messages consumed will be explicitly deleted. (only 1 thread extracts messages)
Some Messages may be explicitly deleted prior to being read (i.e. we may have tombstones after the read-position). (i.e. the TTL initially used is an upper limit) gc_grace would probably be set to 0, as quorum reads will do blocking-repair (i.e. we can have repair turned off, as messages only reside in 1 cluster (DC), and all operations a quorum))
Messages can be added/deleted only, no updates allowed.
In our use case, if a tombstone does not replicate its not a big deal, its ok for us to see the same message multiple times occasionally. (Also we would likely not run Repair on regular basis, as all operations are executing at quorum.)
Generally, it is an anti-pattern, this link talks much of the impact on tombstone:
My opinion is, try to avoid that if possible, but if you really understand the performance impact, and it is not an issue in your architecture, of course you could do that.
Another reason to not do that if possible is, the cassandra data structure is not designed for queues, it will always look ugly, UGLY!
Strongly suggest to consider Redis or RabbitMQ before making your final decision.
