Cassandra is tracking the number of deletion in sstables to trigger a compaction? - cassandra

Wonder whether Cassandra is triggering a compaction (STCS or LCS) based on the number of deletion in sstables? In LCS, as I know, cassandra compacts sstables to next level only if a level is full. But the size of a deletion recored is usually small. If just consider the sstable size to decide whether a level is full or not, it may take long for a tombstone to be reclaimed.
I know rocksdb is triggering compaction using the number of deletions in sstables. This will help to reduce tombstone.

Yes, Cassandra's compaction can be triggered by the number of deletion (a.k.a. tombstones)
Have a look to the common options for all the compaction strategies and specifically this param:
tombstone_threshold
How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable.
See doc here: https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/index.html

Related

Why is forcing major compaction on a table not ideal?

Consider a scenario where a table partitions with thousands of deleted rows. When reading from the table, Cassandra has to scan over thousands of deleted rows before it gets to the live rows.
A common workaround is to manually run a compaction on a node to forcibly get rid of tombstones.
What are the downsides of forcing major compaction on a table (with nodetool compact) and what is the best practice recommendation?
Background
When forcing a major compaction on a table configured with the SizeTieredCompactionStrategy (STCS), all the SSTables on the node get compacted together into a single large SSTable. Due to its size, the resulting SSTable will likely never get compacted out since similar-sized SSTables are not available as compaction candidates. This creates additional issues for the nodes since tombstones do not get evicted and keep accumulating, affecting the cluster's performance.
Caveats
We understand that cluster administrators use major compaction as a way of evicting tombstones which have accumulated as a result of high-delete workloads which in most cases is due to an incorrect data model.
The recommendation in this post does NOT constitute a solution to the underlying issue users face. It should not be considered a long-term fix to the data model problem.
Recommendation
In Apache Cassandra 2.2, CASSANDRA-7272 introduced a huge improvement which splits the output of nodetool compact into multiple files which are 50% then 25% then 12.5% of the original table size until the smallest chunk is 50MB for tables using STCS.
When using major compaction as a last resort for evicting tombstones, use the --split-output (or shorthand -s) to take advantage of this new feature:
$ nodetool compact --split-output -- <keyspace> <table>
NOTE - This feature is only available from Cassandra 2.2 and newer versions.
Also see How to split large SSTables on another server. Cheers!

Can we do a major compaction before flushing a keyspace/table?

I know compaction merges SSTables, but what if I don't flush the keyspace/table before performing a major compaction? In this case, how a compaction works?
Compaction works on all levels, and major compaction executed differently, based on your compaction strategy.
In STCS for example, you will have 2 sstables after the process - one with all the data from beggining of the process, and second one with data written on the tine of the process.
Here's article about it:
http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataMaintain.html#dmlHowDataMaintain__dml-compaction

Does nodetool compact move everything into one SSTable

The Cassandra compaction process reduces the number of SSTables (data files on disk) used to store data. Minor compactions occur automatically. You can tell Cassandra to perform a major compaction using the nodetool compact command.
Does running nodetool compact merely perform one round of compaction, reducing the number of SSTables, but perhaps still resulting in there being several SSTables? Or does it always compact all the SSTables (of a column family) into one SSTable?
It would depend on the compaction strategy you set for the table.
For DateTieredCompactionStrategy and LeveledCompactionStrategy, by definition I don't think even a major compaction would combine all the SSTables since that would go against the structure of SSTables they aim to create.
For the default SizeTieredCompactionStrategy, anecdotally it appears a major compaction will combine the SSTables into a single table. I ran cassandra-stress -write and watched the SSTables for a while. I could see the minor compactions combining SSTables of similar sizes, but not collapsing dissimilar sizes into one.
Then when I'd run a nodetool compact on the table, it would combine SSTables of dissimilar sizes into a single table. I'm not sure if that would be true in all cases.
Taking a quick look at the source, in CompactionManager.java it calls cfStore.getCompactionStrategy().getMaximalTask(gcBefore), which returns a list of tasks that it executes, so that kind of implies it will compact everything, but I didn't drill down any deeper than that.

Do we need to run manual compaction with Leveled compaction strategy and SIzeTiered compaction strategy

We have a couple of tables with Leveled compaction strategy and SizeTiered compaction strategy. How often do we need to run compaction? Thanks in advance
TL;DR
Compaction runs on its own (as long as you did not disable autocompaction in the yaml).
Compaction - what is it?
Per the cassandra write path, we flush memtables to disk periodically into SSTables (sorted string tables) which are immutable. When you update an existing cell, it eventually gets written in an sstable. Possibly a different one than the original record. When we read, sometimes C* has to scan across various sstables (with some optimizations, see bloom filters) to find the latest version of a cell. In Cassandra, last write wins.
Compaction takes sstables and compacts them together removing duplicate data, to optimize reads. This is an automatic operation, though you can tune compactions to run more or less often.
Some useful details on Compaction
Size tiered compaction is the default compaction strategy in cassandra, it looks for sstables that are the same size and compacts them together when it finds enough (4 by default). Size tiered is less IO intensive than leveled and will work better in general when you have smaller boxes and rotational drives.
Leveled compaction is optimized for reads, when you have read heavy workloads or tight read SLA's with lots of updates leveled may make sense. Leveled compaction is more IO and CPU intensive because you are spending more cycles optimizing for reads, but the reads themselves should be faster and hit fewer SStables. Keep an eye on io wait and on pending compactions in nodetool compaction stats when you first enable these or if your workload grows.
Compaction Tunables / Levers
multi threaded compaction - turn it off, the overhead is bigger than the benefit. To the point where it's been removed in C* 2.1.
concurrent compactors - now defaults to 2, used to default to number of cores which is a bad default. If you're on the 2.0 branch and not running the latest DSE check this default and consider decreasing it to 2. this is the number of simultaneous compaction tasks you can run (different column families).
Compaction throttling - a way of limiting the amount of resources that compactions take up. You can tune this on the fly with nodetool getcompactionthreshold and nodetool setcompactionthreshold. You want to tune this to a point where you are not accumulating pending tasks. 0 --> unlimited. Unlimited is, unintuitively, not usually the fastest setting as the system may get bogged down.

Cassandra control SSTable size

Is there a way I could control max size of a SSTable, for example 100 MB so that when there is actually more than 100MB of data for a CF, then Cassandra creates next SSTable?
Unfortunately the answer is not so simple, the sizes of your SSTables will be influenced by your compaction Strategy and there is no direct way to control your max sstable size.
SSTables are initially created when memtables are flushed to disk as SSTables. The size of these tables initially depends on your memtable settings and the size of your heap (memtable_total_space_in_mb being a large influencer). Typically these SSTables are pretty small. SSTables get merged together as part of a process called compaction.
If you use Size-Tiered Compaction Strategy you have an opportunity to have really large SSTables. STCS will combine SSTables in a minor compaction when there are at least min_threshold (default 4) sstables of the same size by combining them into one file, expiring data and merging keys. This has the possibility to create very large SSTables after a while.
Using Leveled Compaction Strategy there is a sstable_size_in_mb option that controls a target size for SSTables. In general SSTables will be less than or equal to this size unless you have a partition key with a lot of data ('wide rows').
I haven't experimented much with Date-Tiered Compaction Strategy yet, but that works similar to STCS in that it merges files of the same size, but it keeps data together in time order and it has a configuration to stop compacting old data (max_sstable_age_days) which could be interesting.
The key is to find the compaction strategy which works best for your data and then tune the properties around what works best for your data model / environment.
You can read more about the configuration settings for compaction here and read this guide to help understand whether STCS or LCS is appropriate for you.

Resources