Consider a scenario where a table partitions with thousands of deleted rows. When reading from the table, Cassandra has to scan over thousands of deleted rows before it gets to the live rows.
A common workaround is to manually run a compaction on a node to forcibly get rid of tombstones.
What are the downsides of forcing major compaction on a table (with nodetool compact) and what is the best practice recommendation?
Background
When forcing a major compaction on a table configured with the SizeTieredCompactionStrategy (STCS), all the SSTables on the node get compacted together into a single large SSTable. Due to its size, the resulting SSTable will likely never get compacted out since similar-sized SSTables are not available as compaction candidates. This creates additional issues for the nodes since tombstones do not get evicted and keep accumulating, affecting the cluster's performance.
Caveats
We understand that cluster administrators use major compaction as a way of evicting tombstones which have accumulated as a result of high-delete workloads which in most cases is due to an incorrect data model.
The recommendation in this post does NOT constitute a solution to the underlying issue users face. It should not be considered a long-term fix to the data model problem.
Recommendation
In Apache Cassandra 2.2, CASSANDRA-7272 introduced a huge improvement which splits the output of nodetool compact into multiple files which are 50% then 25% then 12.5% of the original table size until the smallest chunk is 50MB for tables using STCS.
When using major compaction as a last resort for evicting tombstones, use the --split-output (or shorthand -s) to take advantage of this new feature:
$ nodetool compact --split-output -- <keyspace> <table>
NOTE - This feature is only available from Cassandra 2.2 and newer versions.
Also see How to split large SSTables on another server. Cheers!
Related
I have a cluster (tested it both with 2.1.14 and 3.0.17) in which I have a table that is TWCS (time window compaction). All sstables are kept in the correct windows just fine up until I remove a node from the cluster (in the same dc) and in that moment it seems all sstables are treated as one pool for normal size tiered, causing sstables from different time periods to join. Seeing as my cluster is 400 nodes spread over 6 datacenters a node removal is something quite common.
I did not find any bug talking about this, is this expected behavior? having all the sstables handled together causes a major problem space wise since it means new and old data are in the same sstable causing the old data to remain on disk much longer
(2.1 twcs is achieved using a jar from jeffjirsa github)
Have you disabled read repair on TWCS tables? it could inject out-of-sequence timestamps. TWCS itself will do size tiered but only on current window, iff it falls behind in compaction.
The Cassandra compaction process reduces the number of SSTables (data files on disk) used to store data. Minor compactions occur automatically. You can tell Cassandra to perform a major compaction using the nodetool compact command.
Does running nodetool compact merely perform one round of compaction, reducing the number of SSTables, but perhaps still resulting in there being several SSTables? Or does it always compact all the SSTables (of a column family) into one SSTable?
It would depend on the compaction strategy you set for the table.
For DateTieredCompactionStrategy and LeveledCompactionStrategy, by definition I don't think even a major compaction would combine all the SSTables since that would go against the structure of SSTables they aim to create.
For the default SizeTieredCompactionStrategy, anecdotally it appears a major compaction will combine the SSTables into a single table. I ran cassandra-stress -write and watched the SSTables for a while. I could see the minor compactions combining SSTables of similar sizes, but not collapsing dissimilar sizes into one.
Then when I'd run a nodetool compact on the table, it would combine SSTables of dissimilar sizes into a single table. I'm not sure if that would be true in all cases.
Taking a quick look at the source, in CompactionManager.java it calls cfStore.getCompactionStrategy().getMaximalTask(gcBefore), which returns a list of tasks that it executes, so that kind of implies it will compact everything, but I didn't drill down any deeper than that.
We have a couple of tables with Leveled compaction strategy and SizeTiered compaction strategy. How often do we need to run compaction? Thanks in advance
TL;DR
Compaction runs on its own (as long as you did not disable autocompaction in the yaml).
Compaction - what is it?
Per the cassandra write path, we flush memtables to disk periodically into SSTables (sorted string tables) which are immutable. When you update an existing cell, it eventually gets written in an sstable. Possibly a different one than the original record. When we read, sometimes C* has to scan across various sstables (with some optimizations, see bloom filters) to find the latest version of a cell. In Cassandra, last write wins.
Compaction takes sstables and compacts them together removing duplicate data, to optimize reads. This is an automatic operation, though you can tune compactions to run more or less often.
Some useful details on Compaction
Size tiered compaction is the default compaction strategy in cassandra, it looks for sstables that are the same size and compacts them together when it finds enough (4 by default). Size tiered is less IO intensive than leveled and will work better in general when you have smaller boxes and rotational drives.
Leveled compaction is optimized for reads, when you have read heavy workloads or tight read SLA's with lots of updates leveled may make sense. Leveled compaction is more IO and CPU intensive because you are spending more cycles optimizing for reads, but the reads themselves should be faster and hit fewer SStables. Keep an eye on io wait and on pending compactions in nodetool compaction stats when you first enable these or if your workload grows.
Compaction Tunables / Levers
multi threaded compaction - turn it off, the overhead is bigger than the benefit. To the point where it's been removed in C* 2.1.
concurrent compactors - now defaults to 2, used to default to number of cores which is a bad default. If you're on the 2.0 branch and not running the latest DSE check this default and consider decreasing it to 2. this is the number of simultaneous compaction tasks you can run (different column families).
Compaction throttling - a way of limiting the amount of resources that compactions take up. You can tune this on the fly with nodetool getcompactionthreshold and nodetool setcompactionthreshold. You want to tune this to a point where you are not accumulating pending tasks. 0 --> unlimited. Unlimited is, unintuitively, not usually the fastest setting as the system may get bogged down.
We have Cassandra 1.1.1 servers with Leveled Compaction Strategy.
The system works so that there are read and delete operations. Every half a year we delete approximately half of the data while new data comes in. Sometimes it happens that disk usage goes up to 75% while we know that real data take about 40-50% other space is occupied by tombstones. To avoid disk overflow we force compaction of our tables by dropping all SSTables to Level 0. For that we remove .json manifest file and restart Cassandra node. (gc_grace option does not help since compaction starts only after level is filled)
Starting from Cassandra 2.0 the manifest file was moved to sstable file itself: https://issues.apache.org/jira/browse/CASSANDRA-4872
We are considering migration to Cassandra 2.x while we afraid we won't have such a possibility as forcing leveled compaction any more.
My question is: how could we achieve that our table has a disk space limit e.g. 150GB? (When the limit is exceeded it triggers compaction automatically). The question is mostly about Cassandra 2.x. While any alternative solutions for Cassandra 1.1.1 are also welcome.
It seems like I've found the answers myself.
There is tool sstablelevelreset starting from 2.x version which does similar level reset as deletion of manifest file. The tool is located in tools directory of Cassandra distribution e.g. apache-cassandra-2.1.2/tools/bin/sstablelevelreset.
Starting from Cassandra 1.2 (https://issues.apache.org/jira/browse/CASSANDRA-4234) there is tombstone removal support for Leveled Compaction Strategy which supports tombstone_threshold option. It gives the possibility of setting maximal ratio of tombstones in a table.
Is there a way I could control max size of a SSTable, for example 100 MB so that when there is actually more than 100MB of data for a CF, then Cassandra creates next SSTable?
Unfortunately the answer is not so simple, the sizes of your SSTables will be influenced by your compaction Strategy and there is no direct way to control your max sstable size.
SSTables are initially created when memtables are flushed to disk as SSTables. The size of these tables initially depends on your memtable settings and the size of your heap (memtable_total_space_in_mb being a large influencer). Typically these SSTables are pretty small. SSTables get merged together as part of a process called compaction.
If you use Size-Tiered Compaction Strategy you have an opportunity to have really large SSTables. STCS will combine SSTables in a minor compaction when there are at least min_threshold (default 4) sstables of the same size by combining them into one file, expiring data and merging keys. This has the possibility to create very large SSTables after a while.
Using Leveled Compaction Strategy there is a sstable_size_in_mb option that controls a target size for SSTables. In general SSTables will be less than or equal to this size unless you have a partition key with a lot of data ('wide rows').
I haven't experimented much with Date-Tiered Compaction Strategy yet, but that works similar to STCS in that it merges files of the same size, but it keeps data together in time order and it has a configuration to stop compacting old data (max_sstable_age_days) which could be interesting.
The key is to find the compaction strategy which works best for your data and then tune the properties around what works best for your data model / environment.
You can read more about the configuration settings for compaction here and read this guide to help understand whether STCS or LCS is appropriate for you.