Is it possible to keep tomb stones after tables compaction? - cassandra

I understand that entries get marked with tombstones when a deletion in requested in C*. This way, a soft deletion is performed and it is made effective during compaction:
In addition to consolidating SSTables, the compaction process merges
keys, combines columns, discards tombstones, and creates a new index
in the merged SSTable.
Is it possible to avoid tombstones discard in order to keep them stored for ever? I know that would be against efficiency. I just wonder if it is possible.

You can use a higher gc_grace_seconds value in your table properties to increase the time until tombstones will be effectively deleted.

Related

How to set TTL on Cassandra sstable

We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.

TTL tombstones in Cassandra using LCS are created in the same level data TTLed data?

I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

Is there any way to have Cassandra safely remove tombstones *before* gc_grace_seconds have elapsed?

I know early removal of tombstones is dangerous because it can cause deleted data to be resurrected, but if all replicas have confirmed deletion then such removal should be safe. For example, if a table has replication factor 3 and all 3 nodes containing the key have confirmed that they have the appropriate tombstone, it should be safe to perform a compaction in which the tombstones are removed because there would be no lingering copies of the data.
Is such safe removal of tombstones possible in Cassandra?
I would rather set gc_grace_seconds to infinity and rely on this type of safe compaction of tombstones than worry about the timing of nodetool repair and gc_grace_seconds.
No, it's not possible to remove tombstones without changing your gc_grace_seconds.
There are operations where all 3 replicas may ack the tombstone, remove it, then need it later on. Consider the case where you need to stream an earlier SSTable back into a cluster.
This type of manual tombstone removal will be significantly worse performance wise since you'll only be doing it periodically rather than constantly. You'll be reading extra data unnecessarily as well as constantly compacting tombstones that should be removed.
My recommendation is to set your gc_grace_seconds to something reasonable (10 days is fine) and schedule repairs using opscenter or cassandra-reaper.

Cassandra Rolling Tombstones

I am doing some simple operations in Cassandra, to keep things simple I am using a single node . I have one single row and I add 10,000 columns to it, next I go and delete these 10,000 columns, after a while I add 10,000 more columns to it and then delete them after some time and so on ... The deletes will delete all the columns in that one row.
Here's the thing which I don't understand, even though I delete them I see the size of the database increase, my GCGracePeriod is set to 0 and I am using Leveled Compaction Strategy.
If I understand the tombstones correctly, they should be deleted after the first major compaction, it appears that they are not deleted, even after running nodetool compact command.
I read on some mailing list that these are rolling tombstones (if you frequently update and delete the same row) and are not handled by major compaction. So my question is when are they deleted ? if not then the data would just grow, which i personally think is bad. To make matters worst I could not find any documentation about this particular effect.
First, as you're discovering, this isn't a really good idea. At the very least you should use row-level deletes, not individual column deletes.
Second, There is no such thing as a major compaction with LCS; nodetool compact is a no-op.
Finally, Cassandra 1.2 improves compaction a lot for workloads that generate a lot of tombstones: https://issues.apache.org/jira/browse/CASSANDRA-3442

What does cassandra do during compaction?

I know that cassandra merges sstables, row-keys, remove tombstone and all.
But i am really interested to know how it performs compaction ?
As sstables are immutable does it copy all the relevant data to new file? and while writing to this new file it discard the tombstone marked data.
i know what compaction does but want to know how it make this happen(T)
I hope this thread helps, provided if you follow all the posts and comments in it
http://comments.gmane.org/gmane.comp.db.cassandra.user/10577
AFAIK
Whenever memtable is flushed from memory to disk they are just appended[Not updated] to new SSTable created, sorted via rowkey.
SSTable merge[updation] will take place only during compaction.
Till then read path will read from all the SSTable having that key you look up and the result from them is merged to reply back,
Two types : Minor and Major
Minor compaction is triggered automatically whenever a new sstable is being created.
May remove all tombstones
Compacts sstables of equal size in to one [initially memtable flush size] when minor compaction threshold is reached [4 by default].
Major Compaction is manually triggered using nodetool
Can be applied over a column family over a time
Compacts all the sstables of a CF in to 1
Compacts the SSTables and marks delete over unneeded SSTables. GC takes care of freeing up that space
Regards,
Tamil
Are two ways to run compaction :
A- Minor compaction. Run automatically.
B- Major compaction. Run mannualy.
In both cases takes x files (per CF) and process them. In this process mark the rows with expired ttl as tombstones, and delete the existing tombstones. With this generates a new file. The tombostones generated in this compaction, will be delete in the next compaction (if spend the grace period, gc_grace).
The difference between A and B are the quantity of files taken and the final file.
A takes a few similar files (similar size) and generate a new file.
B takes ALL the files and genrate only one big file.

Resources