How are Cassandra Tombstones deleted in old SSTables? - cassandra

If I have compaction enabled, like SizeTieredCompaction, my SSTables get compacted until a certain size level is reached. When I "delete" an old entry which is in an SSTable partition that is quite old and wont be compacted again in the near future, when is the deletion taking place?
Imagine you delete 100 entries and all are part of a really old SSTable that was compacted several times, has no hot data and is already quite big. It will take ages until it's compacted again and tombstones are removed, right?

When the tombstone is merged with the data in a compaction the data will be deleted from disk. When that happens depends on the rate new data is being added and your compaction strategy. The tombstones are not purged until after gc_grace_seconds to prevent data resurrection (make sure repairs complete within this period of time).
If you override or delete data a lot and not ok with a lot of obsolete data on disk you should probably use LeveledCompactionStrategy instead (I would recommend always defaulting to LCS if using ssds). It can take a long time for the largest sstables to get compacted if using STCS. STCS is more for constantly appending data (like logs or events). If the entries expire over time and you rely heavily on TTLs you will probably want to use the timed window strategy.

Related

How to set TTL on Cassandra sstable

We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.

TTL tombstones in Cassandra using LCS are created in the same level data TTLed data?

I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

When does Cassandra remove data after it has been deleted?

Recently I have been trying to familiarize myself with Cassandra but don't quite understand when data is removed from disk after it has been deleted. The use case I'm particularly interested is expiring time series data with DTCS. As an example, consider the following table:
CREATE TABLE metrics (
metric_id text,
time timestamp,
value double,
PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND
default_time_to_live = 86400 AND
gc_grace_seconds = 3600 AND
compaction = {
'class': 'DateTieredCompactionStrategy',
'timestamp_resolution':'MICROSECONDS',
'base_time_seconds':'3600',
'max_sstable_age_days':'365',
'min_threshold':'4'
};
I understand that Cassandra will create a tombstone for all rows inserted into this table after 24 hours (86400 seconds). These tombstones will first be written to an in-memory Memtable and then flushed to disk as an SSTable when the Memtable reaches a certain size. My question is when will the data that is now expired be removed from disk? Is it the next time the SSTable which contains the data gets compacted? So, with DTCS and min_threshold set to four, we would wait until at least three other SSTables are in the same time window as the expired data, and then those SSTables will be compacted into a SSTable. Is it during this compaction that the data will be removed? It seems to me that this would require Cassandra to maintain some metadata on which rows have been deleted since the newer tombstones would likely not be in the older SSTables that are being compacted.
Alternatively, do the SSTables which contain the tombstones have to be compacted with the SSTables which contain the expired data for the data to be removed? It seems to me that this could result in Cassandra holding the expired data long after it has expired since it's waiting for the new tombstones to be compacted with the older expired data.
Finally, I was also unsure when the tombstones themselves are removed. I know Cassandra does not delete them until after gc_grace_seconds but it can't delete the tombstones until it's sure the expired data has been deleted right? Otherwise it would see the expired data as being valid. Consequently, it seems to me that the question of when tombstones are deleted is intimately tied to the questions above. Thanks!
If it helps I've been experimenting with version 2.0.15 myself.
There's two ways to definitly remove data in Cassandra.
1 : When gc_grace_seconds expires. In your table, gc_grace_seconds is set to 3600. wich means that when you execute a delete statement on a row. You will have to wait 3600 seconds before the data is definitly removed from all the cluster.
2 : When a compaction comes in. During a compaction, Cassandra looks for all the data marked with a tombstone and simply ignores it when writing the new SSTable to ensure that the new SSTable doesn't have already deleted data.
However, it might happen that a node goes down longer than gc_grace_seconds or during a compaction, you'll find more information in the Cassandra documentation.
After some further research and help from others I've realized that I had some misconceptions in my original questions. Specifically: "Data deleted by TTL isn’t the same as issuing a delete – each expiring cell internally has a ttl/timestamp at which it will be converted into a tombstone. There is no tombstone added to the memtable, or flushed to disk – it just treats the expired cells as tombstones once they’re past that timestamp."
Furthermore, Cassandra will check if it can drop SSTables containing only expired data when a memtable is flushed to disk and a minor compaction runs, no more than once every ten minutes though (see this issue). Hope that helps if you had the same questions as me!

When does Cassandra remove data from an SSTable

In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period. What happens to the data? I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period.
True.
What happens to the data?
The data will remain on disk at least for gc_grace_seconds. Next minor compaction right after gc_grace_seconds may remove it, but real timing depends mostly on your dataset and workload type.
I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
If you want to free some disk space, you can:
wait for gc_grace_seconds for normal minor compaction.
run nodetool compact which will trigger major compaction on current node freeing disk space right now.

What does cassandra do during compaction?

I know that cassandra merges sstables, row-keys, remove tombstone and all.
But i am really interested to know how it performs compaction ?
As sstables are immutable does it copy all the relevant data to new file? and while writing to this new file it discard the tombstone marked data.
i know what compaction does but want to know how it make this happen(T)
I hope this thread helps, provided if you follow all the posts and comments in it
http://comments.gmane.org/gmane.comp.db.cassandra.user/10577
AFAIK
Whenever memtable is flushed from memory to disk they are just appended[Not updated] to new SSTable created, sorted via rowkey.
SSTable merge[updation] will take place only during compaction.
Till then read path will read from all the SSTable having that key you look up and the result from them is merged to reply back,
Two types : Minor and Major
Minor compaction is triggered automatically whenever a new sstable is being created.
May remove all tombstones
Compacts sstables of equal size in to one [initially memtable flush size] when minor compaction threshold is reached [4 by default].
Major Compaction is manually triggered using nodetool
Can be applied over a column family over a time
Compacts all the sstables of a CF in to 1
Compacts the SSTables and marks delete over unneeded SSTables. GC takes care of freeing up that space
Regards,
Tamil
Are two ways to run compaction :
A- Minor compaction. Run automatically.
B- Major compaction. Run mannualy.
In both cases takes x files (per CF) and process them. In this process mark the rows with expired ttl as tombstones, and delete the existing tombstones. With this generates a new file. The tombostones generated in this compaction, will be delete in the next compaction (if spend the grace period, gc_grace).
The difference between A and B are the quantity of files taken and the final file.
A takes a few similar files (similar size) and generate a new file.
B takes ALL the files and genrate only one big file.

Resources