We do logged batch inserts and selects against cassandra and always against the same partition and we dont set any null columns therefore we dont need to recover with tombstones. The data has a ttl so it always expire on each node. So we set gc_grace_seconds to 0 but we get a lot of warnings from logged batches. We want to supress only this one warning without supressing all warnings. Is there any way to do so?
On the other hand, i have come across that batch replays also are using gc_grace_seconds if it is shorter than max_hint_window_ms, is that also correct for inserted data? Is there any way that we can end up situations that one node not having the new rows after recovery? In the link below it says only risk of gc_grace_seconds being 0 can be losing the deleted data but we dont delete data so is there still any risk you think?
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cql_commands/cqlCreateTable.html#cqlTableProperties__Gc_grace_seconds
Thanks for the help,
So we set gc_grace_seconds to 0
Setting gc_grace_seconds to zero is a really bad idea. Unless you're running a one-node cluster, you'll eventually start to see old data "zombie" it's way back from a TTL.
i have come across that batch replays also are using gc_grace_seconds if it is shorter than max_hint_window_ms, is that also correct for inserted data?
Yes.
Is there any way that we can end up situations that one node not having the new rows after recovery?
Yes. You could also see TTL'd data come back.
In the link below it says only risk of gc_grace_seconds being 0 can be losing the deleted data but we dont delete data so is there still any risk you think?
TTL data still uses the tombstone mechanism. Those tombstones also need to be replicated. When they're not replicated (node down scenarios) that's when you'll see old data come back.
Is it possible that you point me to a cassandra or datastax official document about how hints uses gc_grace_seconds or why one node needs tomstone in order to expire its own data with ttl?
The official docs have this one covered:
Apache Cassandra Documentation - Compaction: https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
"Once the TTL has expired the data is converted to a tombstone which stays around for at least gc_grace_seconds."
The nuances of tombstones, TTLs, and gc_grace_seconds are further discussed in these posts -
Hinted Handoff and GC Grace Seconds Demystified (TLP is now a part of DataStax) by Radovan Zvoncek: https://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html
Tombstones and Ghost Data Don't Have to be Scary! (I wrote this for DS last year): https://medium.com/building-the-open-data-stack/tombstones-and-ghost-data-dont-have-to-be-scary-with-these-tips-and-tricks-from-datastax-48f3c275b05a
Related
I have a question about the zombie deleted data reappearing in cassandra when we do aggressive compaction and use low gc_grace_seconds.
Based upon the articles that I have read, they say if we get rid of tombstones quickly using lower gc_grace_seconds and other params, lets say if we have a replication factor of 3 and during the tombstone update only 2 of the replicas are up and acknowledge the tombstone. Because of aggressive compaction, those tombstones would be removed along with the shadow data quickly on the two replicas
Now when the one replica which was down before comes backup, it will not be aware of the tombstone and the data being wiped out at the other two replicas. When the read repair happens, this data which wasn't removed on this replica will come back to life and replicated to the other two replicas
But my question is shouldn't hinted handoff take care of it? When the replica comes backup, shouldn't the replica read the hint and fix the data/ delete it at its end. The default expiration period of hinted hand off is 3hrs. So is it the case that it assumes the replica comes back up after the expiry period of hinted hand off or does it consider the fact that the hinted hand off doesn't happen immediately when the replica comes backup. The replica polls every 10 minutes for the hints or through gossip amongst node which will take some time.
The assumptions you have made are incorrect and hints alone won't be able to prevent deleted data from getting resurrected. I'll try to clarify some of the misconceptions you have.
You should not "do aggressive compactions". Forcing a major compaction can cause more problems than you are trying to solve as I've explained in this post -- https://community.datastax.com/questions/6396/.
A low GC grace is a band-aid solution. It is only temporarily masking a bigger problem -- you need to address why your table has a lot of tombstones. You should not set GC grace to a low value that's not practical for you to replace a dead node. Let's say a node has a hardware failure and is not recoverable. If it normally takes you 2 days to fix a server, you should set GC grace higher than 2 days.
Coordinators for write requests store hints for replicas which are down. When the replica comes back online, the coordinator replays the hints to the replica. This is what's known as hinted handoff. A downed replica does not "read the hint" itself.
There is no "default expiration" on hints. The 3 hours is the max_hint_window_in_ms which is the amount of time that hints will be stored for a replica that is down. After this window, hints will no longer be stored by the coordinator and you need to repair the replica manually.
Hints older than GC grace are expired and do not get handed off to a replica. This is another danger of setting GC grace too low.
If you need to set GC grace to a low value and manually run compactions, it indicates to me that there's something wrong with your data model to generate so much tombstones. You need to review your data model and address the root cause. Chances are, you are using Cassandra as a queue or storing queue-like datasets which are an anti-pattern since they generate a lot of tombstones as discussed in this blog post.
Ryan Svihla wrote Understanding Deletes in Cassandra where he proposes an alternative model for handling queue-like data that partially avoids the tombstone issue. Cheers!
3 hours is default time for hinted handoff, but if you have gc_grace_seconds lower than that, then hints are expiring before hitting the time set for hinted handoff. There is a good article from the last pickle that explains how these things are related to each other.
I've a table where I insert data with a TTL of 1 minute and I have a warning in DSE OpsCenter about the high number of tombstones in that table. Which does make sense since in average 80 records per minute are inserted in this table.
So for example for a full day 80 * 60 * 24 = 115200 records inserted and TTL'ed in one day.
My question is what should I do in order to decrease the number of tombstones in this table?
I've been been looking into tombstone_compaction_interval and gc_grace_seconds and this is where it gets a bit confusing as I'm having problems to understand the exact impact of these properties on the tombstones (even after reading the documentation provided by DataStax - http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html and http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html).
I've also been looking into LevelledCompactionStrategy (https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra) since it also does seem to impact the tombstones compaction although I don't fully understand why.
So I'm hoping someone will be able to help me better understand how this all works, or even just let me know if I'm going in the right direction.
Please read this http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html. Very good read.
Overall: gc_grace_seconds parameter is the minimal time that tombstones will be kept on disk after data has been deleted. We need to make sure that all the replicas received the delete and have all tombstones stored to avoid having zombie data issues. By default its 10 days.
tombstone_compaction_interval: As part of this JIRA (https://issues.apache.org/jira/browse/CASSANDRA-4781), this property got introduced.
When the compaction ratio was high enough to trigger a single-SSTable compaction, but that tombstones were not evicted due to overlapping SSTables.
I am not sure about your current datamodel but here are my suggestions.
Probably you have to change your DM. Please read https://academy.datastax.com/resources/getting-started-time-series-data-modeling and Time series modelling( with start & end date) in cassandra
Change write pattern.
Change read pattern. Try to read only active data. (As per your current DM, when you are reading it, its going through tombstone cells in-order to reach active cells)
Try to use TimeWindowCompactionStrategy and tune it as per your workload. (http://thelastpickle.com/blog/2017/01/10/twcs-part2.html)
If you are use TTL while inserting (like with INSERT or UPDATE stmnt), see if you can change it to the Table level.
If you are using STCS and want to change compaction sub-properties, probably you could change
unchecked_tombstone_compaction=true and min_threshold=3 (little bit aggressive)
I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.
We are running a Titan Graph DB server backed by Cassandra as a persistent store and are running into an issue with reaching the limit on Cassandra tombstone thresholds that is causing our queries to fail / timeout periodically as data accumulates. It seems like the compaction is unable to keep up with the number of tombstones being added.
Our use case supports:
High read / write throughputs.
High sensitivity to reads.
Frequent updates to node values in Titan. causing rows to be updated in Cassandra.
Given the above use cases, we are already optimizing Cassandra to aggressively do the following:
Aggressive compaction by using the levelled compaction strategies
Using tombstone_compaction_interval as 60 seconds.
Using tombstone_threshold to be 0.01
Setting gc_grace_seconds to be 1800
Despite the following optimizations, we are still seeing warnings in the Cassandra logs similar to:
[WARN] (ReadStage:7510) org.apache.cassandra.db.filter.SliceQueryFilter: Read 0 live and 10350 tombstoned cells in .graphindex (see tombstone_warn_threshold). 8001 columns was requested, slices=[00-ff], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
Occasionally as time progresses, we also see the failure threshold breached and causes errors.
Our cassandra.yaml file has the tombstone_warn_threshold to be 10000, and the tombstone_failure_threshold to be much higher than recommended at 250000, with no real noticeable benefits.
Any help that can point us to the correct configurations would be greatly appreciated if there is room for further optimizations. Thanks in advance for your time and help.
Sounds like the root of your problem is your data model. You've done everything you can do to mitigate getting TombstoneOverwhelmingException. Since your data model requires such frequent updates causing tombstone creation a eventual consistent store like Cassandra may not be a good fit for your use case. When we've experience these types of issues we had to change our data model to fit better with Cassandra strengths.
About deletes http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252 (slides 34-39)
Tombstones are not compacted away until the gc_grace_seconds configuration on a table has elapsed for a given tombstone. So even increasing your compaction interval your tombstones will not be removed until gc_grace_seconds has elapsed, with the default being 10 days. You could try tuning gc_grace_seconds down to a lower value and do repairs more frequently (usually you want to schedule repairs to happen every gc_grace_seconds_in_days - 1 days).
So everyone here is right. If you repair and compact frequently you an reduce your gc_grace_seconds number.
It may also however be worth considering that Inserting Nulls is equivalent to a delete. This will increase your number of tombstones. Instead you'll want to insert the UNSET_VALUE if you're using prepared statements. Probably too late for you, but if anyone else comes here.
The variables you've tuned are helping you expire tombstones, but it's worth noting that while tombstones can not be purged until gc_grace_seconds, Cassandra makes no guarantees that tombstones WILL be purged at gc_grace_seconds. Indeed, tombstones are not compacted until the sstable containing the tombstone is compacted, and even then, it will not be eliminated if there is another sstable containing a cell that is shadowed.
This results in tombstones potentially persisting a very long time, especially if you're using sstables that are infrequently compacted (say, very large STCS sstables). To address this, tools exist such as the JMX endpoint to forceUserDefinedCompaction - if you're not adept at using JMX endpoints, tools to do this for you automatically exist such as http://www.encql.com/purge-cassandra-tombstones/
I am always inserting data PRIMARY KEY ((site_name,date),time,id) while the site_name and date can be same the time which is a tamed field and id(uuid) is different. So I always add new data. Data is inserted with TTL (Currently 3 days). So as i don't delete or update can I disable compaction? Considering TTL is there. Would it effect anything. Also as no record is deleted can i disable gc_grace time? I wanna put as much less load on the servers as possible. Much appreciate if anyone can help ?
TTLs create tombstones. As such, compaction is required. If your data is time series data, you might consider the new date tiered compaction: http://www.datastax.com/dev/blog/datetieredcompactionstrategy .
If you use TTLs and set grace to 0, you're asking for trouble unless your cluster is a single node one. the grace is the amount of time to wait before collecting tombstones. If it's 0, it won't wait. This may sound good, but in reality, it'll mean the "deletion" might not propagate across the cluster, and the deleted data may re-appear (coz other nodes may have it, and the last present value will "win"). This type of data is called zombie data. Zombies are bad. Don't feed the zombies.
You can disable auto compaction: http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsDisableAutoCompaction.html . But again, I doubt you'll gain much from this. Again, look at date tiered compaction.
you can permanently disable autocompaction on tables (column families) separately, like this (cql)
alter table <tablename> with compaction = { 'class':'CompactionStrategy', 'enabled':'false'}
the enabled:false permanently disables autocompaction on that table, but you can do manual compaction whenever you like using 'nodetool compact' command
You can set gc grace to 0, but not turn off compaction. If you never delete or update I think you might be able to turn off compaction.
Edit:
Optimizations in C* from 2.0 and onwards exactly for this case:
https://issues.apache.org/jira/browse/CASSANDRA-4917
About TTL, tombstones and GC Grace
http://mail-archives.apache.org/mod_mbox/cassandra-user/201307.mbox/%3CCALY91SNy=cxvcGJh6Cp171GnyXv+EURe4uadso1Kgb4AyFo09g#mail.gmail.com%3E