I have a question about the zombie deleted data reappearing in cassandra when we do aggressive compaction and use low gc_grace_seconds.
Based upon the articles that I have read, they say if we get rid of tombstones quickly using lower gc_grace_seconds and other params, lets say if we have a replication factor of 3 and during the tombstone update only 2 of the replicas are up and acknowledge the tombstone. Because of aggressive compaction, those tombstones would be removed along with the shadow data quickly on the two replicas
Now when the one replica which was down before comes backup, it will not be aware of the tombstone and the data being wiped out at the other two replicas. When the read repair happens, this data which wasn't removed on this replica will come back to life and replicated to the other two replicas
But my question is shouldn't hinted handoff take care of it? When the replica comes backup, shouldn't the replica read the hint and fix the data/ delete it at its end. The default expiration period of hinted hand off is 3hrs. So is it the case that it assumes the replica comes back up after the expiry period of hinted hand off or does it consider the fact that the hinted hand off doesn't happen immediately when the replica comes backup. The replica polls every 10 minutes for the hints or through gossip amongst node which will take some time.
The assumptions you have made are incorrect and hints alone won't be able to prevent deleted data from getting resurrected. I'll try to clarify some of the misconceptions you have.
You should not "do aggressive compactions". Forcing a major compaction can cause more problems than you are trying to solve as I've explained in this post -- https://community.datastax.com/questions/6396/.
A low GC grace is a band-aid solution. It is only temporarily masking a bigger problem -- you need to address why your table has a lot of tombstones. You should not set GC grace to a low value that's not practical for you to replace a dead node. Let's say a node has a hardware failure and is not recoverable. If it normally takes you 2 days to fix a server, you should set GC grace higher than 2 days.
Coordinators for write requests store hints for replicas which are down. When the replica comes back online, the coordinator replays the hints to the replica. This is what's known as hinted handoff. A downed replica does not "read the hint" itself.
There is no "default expiration" on hints. The 3 hours is the max_hint_window_in_ms which is the amount of time that hints will be stored for a replica that is down. After this window, hints will no longer be stored by the coordinator and you need to repair the replica manually.
Hints older than GC grace are expired and do not get handed off to a replica. This is another danger of setting GC grace too low.
If you need to set GC grace to a low value and manually run compactions, it indicates to me that there's something wrong with your data model to generate so much tombstones. You need to review your data model and address the root cause. Chances are, you are using Cassandra as a queue or storing queue-like datasets which are an anti-pattern since they generate a lot of tombstones as discussed in this blog post.
Ryan Svihla wrote Understanding Deletes in Cassandra where he proposes an alternative model for handling queue-like data that partially avoids the tombstone issue. Cheers!
3 hours is default time for hinted handoff, but if you have gc_grace_seconds lower than that, then hints are expiring before hitting the time set for hinted handoff. There is a good article from the last pickle that explains how these things are related to each other.
Related
I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.
As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL
Data we store in Cassandra is pure time series with no manual deletes. Data gets deleted only by TTL.
For such use cases, is repair really needed? What is the impact of not running repair?
Tobstoned data really deleted after gc_grace_seconds + compaction. if table with tombstoned data is not compacted, you will stack with this data, and it will cause performance degradation.
If you don't run repair within gc_grace period, dead data can live again. Here's datastax article on this (and why you need to run repairs regulary):
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
EDIT:
TTLed data isn't tombstoned on the time of the expire, but only when there's a compaction proccess (at least in 3.9). You will not see expired data, even when there's no tombstones.
So, if there is a problem with the node, and TTLed data isn't got it's tombstone on compaction, it will get one on the next compaction, or will be simply deleted. According to this, and the fact that the data is NEVER deleted and only expired, and you don't have any owerwrites to same key, you don't have to run repairs for data consistency.
And, regarding to all above, i will recommend to run repairs once in a while (with much higher interval between them), in case that something accidentally was written not using you write pass.
If you set TTL, cassandra will mark the data with tombstone after the time exceeded. If you don't run repair regularly, huge tombstone will be generated and it will affect cassandra performance
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
We are running a Titan Graph DB server backed by Cassandra as a persistent store and are running into an issue with reaching the limit on Cassandra tombstone thresholds that is causing our queries to fail / timeout periodically as data accumulates. It seems like the compaction is unable to keep up with the number of tombstones being added.
Our use case supports:
High read / write throughputs.
High sensitivity to reads.
Frequent updates to node values in Titan. causing rows to be updated in Cassandra.
Given the above use cases, we are already optimizing Cassandra to aggressively do the following:
Aggressive compaction by using the levelled compaction strategies
Using tombstone_compaction_interval as 60 seconds.
Using tombstone_threshold to be 0.01
Setting gc_grace_seconds to be 1800
Despite the following optimizations, we are still seeing warnings in the Cassandra logs similar to:
[WARN] (ReadStage:7510) org.apache.cassandra.db.filter.SliceQueryFilter: Read 0 live and 10350 tombstoned cells in .graphindex (see tombstone_warn_threshold). 8001 columns was requested, slices=[00-ff], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
Occasionally as time progresses, we also see the failure threshold breached and causes errors.
Our cassandra.yaml file has the tombstone_warn_threshold to be 10000, and the tombstone_failure_threshold to be much higher than recommended at 250000, with no real noticeable benefits.
Any help that can point us to the correct configurations would be greatly appreciated if there is room for further optimizations. Thanks in advance for your time and help.
Sounds like the root of your problem is your data model. You've done everything you can do to mitigate getting TombstoneOverwhelmingException. Since your data model requires such frequent updates causing tombstone creation a eventual consistent store like Cassandra may not be a good fit for your use case. When we've experience these types of issues we had to change our data model to fit better with Cassandra strengths.
About deletes http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252 (slides 34-39)
Tombstones are not compacted away until the gc_grace_seconds configuration on a table has elapsed for a given tombstone. So even increasing your compaction interval your tombstones will not be removed until gc_grace_seconds has elapsed, with the default being 10 days. You could try tuning gc_grace_seconds down to a lower value and do repairs more frequently (usually you want to schedule repairs to happen every gc_grace_seconds_in_days - 1 days).
So everyone here is right. If you repair and compact frequently you an reduce your gc_grace_seconds number.
It may also however be worth considering that Inserting Nulls is equivalent to a delete. This will increase your number of tombstones. Instead you'll want to insert the UNSET_VALUE if you're using prepared statements. Probably too late for you, but if anyone else comes here.
The variables you've tuned are helping you expire tombstones, but it's worth noting that while tombstones can not be purged until gc_grace_seconds, Cassandra makes no guarantees that tombstones WILL be purged at gc_grace_seconds. Indeed, tombstones are not compacted until the sstable containing the tombstone is compacted, and even then, it will not be eliminated if there is another sstable containing a cell that is shadowed.
This results in tombstones potentially persisting a very long time, especially if you're using sstables that are infrequently compacted (say, very large STCS sstables). To address this, tools exist such as the JMX endpoint to forceUserDefinedCompaction - if you're not adept at using JMX endpoints, tools to do this for you automatically exist such as http://www.encql.com/purge-cassandra-tombstones/
This is a two-part question regarding nodetool repair and garbage collection.
Let's consider a replication factor of 3 for all tables, and suppose reads and writes require two confirmations of success to succeed. Based on my understanding of Cassandra, a successful write or delete would never be in danger of being missed as long as a read requires at least two responses, accepting only only the latest timestamp. This makes sense to me, but is it correct?
As a closely related question, if I configure Cassandra never to perform GC, but still perform nodetool repair periodically, will this suffice to garbage-collect old tombstones? Intuitively, a successfully repaired key range should not need to keep tombstones, so they could in theory be discarded when a repair is performed. Is this the case?
If my above two hypotheses are correct, it seems like we can achieve the following:
Consistent reads and writes with no resurrected data (due to quorum reads and writes and avoiding GC completely)
No unbounded growth in stale tombstones (due to periodically running nodetool repair, which hopefully performs GC if my above hypothesis is correct)
This post explains that quorum doesn't guarantee consistency: Read Operation in Cassandra at Consistency level of Quorum?
Assuming "GC" means compaction, I don't think nodetool repair will suffice to delete tombstones or take care of other compaction tasks. https://issues.apache.org/jira/browse/CASSANDRA-6602 describes a compaction-less scenario that sounds like what you're considering. If this is what you're doing, the recommended solution is to use DateTieredCompactionStrategy (DTCS) to store data written within a certain period of time in the same SSTable. DTCS was released in Cassandra 2.1.1 today and is described here: http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/tabProp.html?scroll=tabProp__moreCompaction