I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.
Related
I am facing problem with cassandra compaction on table that stores event data. These events are generated from censors and have associated TTL. By default each event has TTL of 1 day. Few events have different TTL like 7/10/30 which is business requirement. Few events can have TTL of 5 years if event needs to be stored. More than 98% of rows have TTL of 1 day.
Although minor compaction is triggered from time to time, disk usage are constantly increasing. This is because of how SizeTierd compaction-strategy works i.e. it would choose table of similar size for compaction. This creates few huge tables which aren't compacted for long time. Presence of few large table would increase average SSTable size and compaction is run less frequently. Looks like STCS is not right choice. In load-test env, I added data to tables and switched to leveled compaction-strategy. With LCS disk space was reclaimed till certain point and then disk usage were constant. CPU was also less compared to STCS. However time window compaction-strategy looks more promising as it works well for time series TTLed data. I am going to test TWCS with my dataset. Mean while I am trying to find answer for few queries to which I didn't find answer or whatever I found was not clear to me.
In my use case, event is added to table with associated TTL. Then there are 5 more updates on same event within next minute. Updates are not made on single column, instead complete row is re-written with new TTL(which is same for al columns). This new TTL is liked to be slightly less than previous TTL. For example, event is created with TTL of 86400 seconds. It is updated after 5 second then new TTL would be 86395. Further update would be with new TTL which would be slightly less than 86395. After 4-5 updates, no update would be made to more than 99% rows. 1% rows would be re-written with TTL of 5 years.
From what I read: TWCS is for data inserted with immutable TTL. Does
this mean I should not use TWCS?
Also out of order writes are not well handled by TWCS. If event is
created at 10 AM on 5th Sep with 1 day TTL and same event row is
re-written with TTL of 5 years on 10th or 12th Sep, would that be
our of order write? I suppose out of order would be when I am
setting timestamp on data while adding data to DB or something that
would be caused by read repair.
Any guidance/suggestion will be appreciated!
NOTE: I am using cassandra 2.2.8, so I'll be creating jar for TWCS and then use it.
TWCS is a great option under certain circumstances. Here are the things to keep in mind:
1) One of the big benefits of TWCS is that merging/reconciliation among sstables does not occur. The oldest one is simply "lopped" off. Because of that, you don't want to have rows/cells span multiple "buckets/windows".
For example, If you insert a single column during one window and then the next window you insert a different column (i.e. an update of the same row but different column at a later period of time). Instead of compaction creating a single row with both columns, TWCS would lop one of the columns off (the oldest). Actually I am not sure if TWCS will even allows this to occur, but was giving you an example of what would happen if it did. In this example, I believe TWCS will disallow the removal of either sstable until both windows expire. Not 100% sure though. Either way, avoid this scenario.
2) TWCS has similar problems when out-of-time writes occur (overlap). There is a great article by the last pickle that explains this:
https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Overlap can occur by repair or from an old compaction (i.e. if you were using STCS and then switched to TWCS, some of the sstables may overlap).
If there is overlap, say, between 2 sstables, you have to wait for both sstables to completely expire before TWCS can remove either of them, and when it does, both with be removed.
If you avoid both scenarios described above, TWCS is very efficient due to the nature of how it cleans things up - no more merging sstables. Simply remove the oldest window.
When you do set up TWCS, you have to remember that the oldest window gets removed after the TTLs expire and GC Grace passes as well - don't forget to add that part. Having a varying TTL number among rows, as you have described, may delay windows from getting removed. If you want to see what is either blocking TWCS from removing a sstable or what the sstables look like, you can use sstableexpiredblockers or the script in the above mentioned URL (which is essentially sstablemetadata with some fancy scripting).
Hopefully that helps.
-Jim
We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.
I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.
I have a Cassandra 2.1 cluster where we insert data though Java with TTL as the requirement of persisting the data is 30 days.
But this causes problem as the files with old data with tombstones is kept on the disk. This results in disk space being occupied by data which is not required. Repairs take a lot of time to clear this data (upto 3 days on a single node)
Is there a better way to delete the data?
I have come across this on datastax
Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html?hl=tombstone
Will the data be deleted more efficiently if I set TTL at table level instead of setting each time while inserting.
Also, documentation is for Cassandra 3, so will I have to upgrade to newer version to get any benefits?
Setting default_time_to_live applies the default ttl to all rows and columns in your table - and if no individual ttl is set (and cassandra has correct ntp time on all nodes), cassandra can easily drop those data safely.
But keep some things in mind: your application is still able so set a specific ttl for a single row in your table - then normal processing will apply. On top, even if the data is ttled it won't get deleted immediately - sstables are still immutable, but tombstones will be dropped during compaction.
What could help you really a lot - just guessing - would be an appropriate compaction strategy:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/dml/dmlHowDataMaintain.html#dmlHowDataMaintain__twcs-compaction
TimeWindowCompactionStrategy (TWCS)
Recommended for time series and expiring TTL workloads.
The TimeWindowCompactionStrategy (TWCS) is similar to DTCS with
simpler settings. TWCS groups SSTables using a series of time windows.
During compaction, TWCS applies STCS to uncompacted SSTables in the
most recent time window. At the end of a time window, TWCS compacts
all SSTables that fall into that time window into a single SSTable
based on the SSTable maximum timestamp. Once the major compaction for
a time window is completed, no further compaction of the data will
ever occur. The process starts over with the SSTables written in the
next time window.
This help a lot - when choosing your time windows correctly. All data in the last compacted sstable will have roughly equal ttl values (hint: don't do out-of-order inserts or manual ttls!). Cassandra keeps the youngest ttl value in the sstable metadata and when that time has passed cassandra simply deletes the entire table as all data is now obsolete. No need for compaction.
How do you run your repair? Incremental? Full? Reaper? How big in terms of nodes and data is your cluster?
The quick answer is yes. The way it is implemented is by deleting the SStable/s directly from disk. Deleting an SStable without the need to compact will clear up disk space faster. But you need to be sure that the all the data in a specific sstable is "older" than the globally configured TTL for the table.
This is the feature referred to in the paragraph you quoted. It was implemented for Cassandra 2.0 so it should be part of 2.1
I've a table where I insert data with a TTL of 1 minute and I have a warning in DSE OpsCenter about the high number of tombstones in that table. Which does make sense since in average 80 records per minute are inserted in this table.
So for example for a full day 80 * 60 * 24 = 115200 records inserted and TTL'ed in one day.
My question is what should I do in order to decrease the number of tombstones in this table?
I've been been looking into tombstone_compaction_interval and gc_grace_seconds and this is where it gets a bit confusing as I'm having problems to understand the exact impact of these properties on the tombstones (even after reading the documentation provided by DataStax - http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html and http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html).
I've also been looking into LevelledCompactionStrategy (https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra) since it also does seem to impact the tombstones compaction although I don't fully understand why.
So I'm hoping someone will be able to help me better understand how this all works, or even just let me know if I'm going in the right direction.
Please read this http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html. Very good read.
Overall: gc_grace_seconds parameter is the minimal time that tombstones will be kept on disk after data has been deleted. We need to make sure that all the replicas received the delete and have all tombstones stored to avoid having zombie data issues. By default its 10 days.
tombstone_compaction_interval: As part of this JIRA (https://issues.apache.org/jira/browse/CASSANDRA-4781), this property got introduced.
When the compaction ratio was high enough to trigger a single-SSTable compaction, but that tombstones were not evicted due to overlapping SSTables.
I am not sure about your current datamodel but here are my suggestions.
Probably you have to change your DM. Please read https://academy.datastax.com/resources/getting-started-time-series-data-modeling and Time series modelling( with start & end date) in cassandra
Change write pattern.
Change read pattern. Try to read only active data. (As per your current DM, when you are reading it, its going through tombstone cells in-order to reach active cells)
Try to use TimeWindowCompactionStrategy and tune it as per your workload. (http://thelastpickle.com/blog/2017/01/10/twcs-part2.html)
If you are use TTL while inserting (like with INSERT or UPDATE stmnt), see if you can change it to the Table level.
If you are using STCS and want to change compaction sub-properties, probably you could change
unchecked_tombstone_compaction=true and min_threshold=3 (little bit aggressive)