What is the effect of number of levels in levelled compaction? - cassandra

I know how levelled compaction works in DBS like Cassandra, rocksdb etc. Some have max number of levels 4 and some have 7. How does this number affect compaction process? Why can't I have just 2 levels, 1st one which has flushed mem-table data (overlap possible between files) and 2nd one which contains nonoverlapping SSTs?
If there is any doc or duplicate question, please redirect.
Edit-1: Duplicate data increases when the number of levels goes up.

LCS comes to solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request).
Leveled compaction divides the small sstables (“fragments”) into levels:
Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels.
Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).
While solving, or at least significantly improving, the space amplification problem, LCS makes another problem, write amplification, worse.
"Write amplification” is the amount of bytes we had to write to the disk for each one byte of newly flushed sstable data. Write amplification is always higher than 1.0 because we write each piece of data to the commit-log, and then write it again to an sstable, and then each time compaction involves this piece of data and copies it to a new sstable, that’s another write.
Read more about it here:
https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/
https://docs.scylladb.com/kb/compaction/
https://docs.scylladb.com/architecture/compaction/compaction-strategies/

Leveled compaction works Scylla very similarly to how it works in Cassandra and Rocksdb (with some small differences). If you want a short overview on how leveled compaction works in Scylla and why, I suggest that you read my blog post https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/.
Your specific question on why two levels (L0 of recently flushed sstables, Ln of disjoint-range sstables) are not enough - is a very good question:
The main problem is that a single flushed memtable (sstable in L0), containing a random collection of writes, will often intersect all of the sstables in Ln. This means rewriting the entire database every time there's a new memtable flushed, and the result is a super-huge amount of write amplification, which is completely unacceptable.
One way to reduce this write amplification significantly (but perhaps not enough) is to introduce a cascade of intermediate levels, L0, L1, ..., Ln. The end result is that we have L(n-1) which is 1/10th (say) the size of Ln, and we merge L(n-1) - not a single sstable - into Ln. This is the approach that leveled compaction strategy (LCS) uses in all systems you mentioned.
A completely different approach could be not to merge a single sstable into Ln, but rather try to collect a large amount of data first, and only then merge it into Ln. We can't just collect 1,000 tables in L0 because this would make reads very slow. Rather, to collect this large amount of data, one could use size-tiered compaction (STCS) inside L0. In other words, this approach is a "mix" of STCS and LCS with two "levels": L0 uses STCS on new sstables, Ln contains a run of sstables (sstables with disjoint ranges). When L0 reaches 1/10th (say) the size of Ln, L0 is compacted into Ln. Such a mixed approach could have lower write amplification than LCS, but because most of the data is in a run in Ln, it would have same low space and read amplifications as in LCS. None of the mentioned databases (Scylla, Cassandra, or Rocksdb) has such "mixed" compaction supported, as far as I know.

Related

Cassandra hard disk requirement with SizeTieredCompactionStrategy

I was going through Cassandra's SizeTieredCompactionStrategy and found out that it can sometimes double the size of the dataset's largest table during the compaction process. But I didn't get any information regarding when this can happen? Does anyone know about this?
This requirement arises from the fact that compaction process should have enough space to take all SSTables that should be compacted, read data from them, and write new SSTable to the same disk. In the worst case, if you have table consisting of all SSTables that should be compacted, their total size is 50% of available disk space, and no data will be thrown away - in this case, compaction process will write a single SSTable that is equal to size of input data. And if you have input data occupying more than 50% of disk space, compaction won't have enough space for writing a new version.
In real situation, you need to have enough space to compact biggest SSTables in your biggest table performed by N compaction threads at the same time. If you have many tables of similar size, then this restriction is not so strong...

Disk space not decreasing after gc_grace_seconds (10 days) elapsed

I deleted a lot of data(10 billions rows) from my table (made a small app that query from LONG.MIN_VALUE up to LONG.MAX_VALUE in token range and DELETE some data).
Disk space did not decrease after 20 days from then (also I run nodetool repair on 1 node from total of 6), but number of keys(estimate) have decrease accordingly.
Will the space decrease in the future in a natural way, or there is some utility from cassandra I need to run to reclaim the space?
In general, yes, the space will decrease accordingly (once compaction runs). Depending on the compaction strategy chosen for that table, it could take some time. Size Tiered Compaction Strategy for example requires, by default, that 4 sstables be the same size before being compacted. If you have very large SSTABLES then they may not get compacted for quite some time, or indefinitely if there are not 4 of the same size. A manual compaction would fix that situation, but it would put everything in a single sstable, which is not recommended either. If the resulting sstable of a manual compaction is very small, then it won't hurt you. If it ends up compacting to a "large" SSTABLE, then you have sacrificed "now" for "later" (again, because you now have only a single large sstable, it may take a very long time for it to participate in compaction). You can split the sstable after a manual compaction to remidy the situation you've created, but you'll have to take your node off-line to do it. Anyway, short answer is that over time the table should shrink accordingly - when depends on the compaction strategy chosen.
Try running "nodetool garbagecollect" as this will trigger compaction and removes deleted data. which you can verify running status by "nodetool compacationstats"

LeveledCompactionStrategy : what is the impact of tuning the sstable_size_in_mb?

To enhance read performance, I try to have fewer underlying SSTables with LCS, so I set sstable_size_in_mb to 1280MB as suggested by some articles, which pointed out that the 160MB default value was picked out by Cassandra core team a long time ago, on a pretty old server by now with only 2GB RAM. However, my concern is about implications of having a higher value of sstable_size_in_mb.
What I understand is LCS regularly compact all the SSTables in L0 together with all the SSTables in L1, then replacing the entire content of L1. So each time L1 is replaced, the hardware requirements CPU/RAM and write amplification may be higher with a higher value of sstable_size_in_mb. Indeed, if sstable_size_in_mb = 1280MB, so 10 tables of 1280MB in L1 have to be merged each time with all L0 tables. And maybe there are also implications on a higher level, even if the SSTables to replace seems lower (one L1 SSTables is merged with 10 L2 SSTables, then those 10 L2 SSTables are replaced).
Questions :
Having a higher value of sstable_size_in_mb may increase read performance by lowering the number of SSTables involved in a CQL Table. However, what are the others implications to have such higher value (like 1280MB) for sstable_size_in_mb?
If higher value, are there any corresponding configuration to tune (Garbage Collector, chunk cache, ...) to allow better performance for compactions of those larger SSTables, and having less GC activity?
More subjective question, what is the typical value of sstable_size_in_mb you use in your deployment?
To answer your first question, I'd like to quote some original text from Jonathan Ellis in CASSANDRA-5727 when the community initially looked into the sstable_size_in_mb (and subsequently decided on the 160 number).
"larger files mean that each level contains more data, so reads will
have to touch less sstables, but we're also compacting less unchanged
data when we merge forward." (Note: I suspect there was a typo and he meant "we're also compacting more unchanged data when we merge forward", which aligns with what you stated in your second paragraph, and what he meant by larger file impacting "compaction efficiency".)
As for any other implication: it might push the envelope on the LCS node density upper bound, as it would allow much higher density for the same number of SSTables per node.
To answer your second question, compaction does create a lot of churns in the heap, as it creates many short lived objects from SSTables. Given much bigger SSTables involved in the compaction when you use the 1280MB size, you should pay attention to your gc.log and watch out for "Humongous Allocation" messages (if you use G1GC). If they turn out to happen a lot, you can increase the region size to avoid costly collections of humongous objects by using the -XX:G1HeapRegionSize option.
For your third question, as far as I know, many have used the 160MB default value for a long time, as we don't have a comprehensive analysis published on the impact/benefit from benchmarking larger SSTable size with modern hardware yet (I attempted to run some quick tests, but got busy with other things and didn't finish that effort, sorry). However, I do think if people are interested in achieving higher node density with LCS, this SSTable size is a parameter that's worth exploring.

Cassandra Leveled Compaction vs TimeWindowCompactionStrategy

The idea behind TimeWindowCompactionStrategy is each SSTable has records from only a particular time window, instead of records from different time windows getting mixed with each other.
Doesn't Leveled Compaction result in something similar? SSTables are compacted with other SSTables from the same level, which are all from the same time window. (aka SSTables at higher levels are always older). This looks very similar to DateTieredCompactionStrategy, except that the SSTable size is determined by max size in MBs instead of a time window.
LeveledCS is grouping SSTables by size in a multilevel structure, while TimeWindowCS is making same-interval SSTables (thus it's a single level structure) and has limitations on number of buckets so tables with TWCS requires TTL for all rows.
You are correct about difference between DTCS and LCS.
P.S. I recommend to watch the slides from presentation by the author of TWCS to get the reasoning behind it.

How does the Leveled Compaction Strategy ensure 90% of reads are from one sstable

I am trying to understand how the Leveled Compaction Strategy in Cassandra works that guarantees 90% of all reads will be satisfied from a single sstable.
From DataStax Doc:
new sstables are added to the first level, L0, and immediately compacted with the sstables in L1. When L1 fills up, extra sstables are promoted to L2. Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap.
LeveledCompactionStrategy (LCS) in Cassandra implements the internals of LevelDB. You can check the exact implementation details in LevelDB implementation doc.
In order to give you a simple explanation take into account the following points:
Every SSTable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent level is 10x the size. (in L1 you'll have 50MB of data, L2 500MB, and so on).
SSTables are created with the guarantee that they don't overlap
When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you'll have 50MB in ~10 files, L2 500MB in ~100 files, etc..
In bold are the relevant details that justify the 90% reads from the same file (SSTable). Let's do the math together and everything will become clearer.
Imagine you have keys A,B,C,D,E in L0, and each keys takes 1MB of data.
Next we insert key F. Because level 0 is filled a compaction will create a file with [A,B,C,D,E] in level 1, and F will remain in level 0.
That's ~83% of data in 1 file in L1.
Next we insert G,H,I,J and K. So L0 fills up again, L1 gets a new sstable with [I,G,H,I,J].
By now we have K in L0, [A,B,C,D,E] and [F,G,H,I,J] in L1
And that's ~90% of data in L1.
If we continue inserting keys we will get around the same behavior so, that's why you get 90% of reads served from roughly the same file/SSTable.
A more in depth and detailed (what happens with updates and tombstones) info is given in this paragraph on the link I mentioned (the sizes for compaction election are different because they are LevelDB defaults, not C*s):
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

Resources