Cassandra hard disk requirement with SizeTieredCompactionStrategy - cassandra

I was going through Cassandra's SizeTieredCompactionStrategy and found out that it can sometimes double the size of the dataset's largest table during the compaction process. But I didn't get any information regarding when this can happen? Does anyone know about this?

This requirement arises from the fact that compaction process should have enough space to take all SSTables that should be compacted, read data from them, and write new SSTable to the same disk. In the worst case, if you have table consisting of all SSTables that should be compacted, their total size is 50% of available disk space, and no data will be thrown away - in this case, compaction process will write a single SSTable that is equal to size of input data. And if you have input data occupying more than 50% of disk space, compaction won't have enough space for writing a new version.
In real situation, you need to have enough space to compact biggest SSTables in your biggest table performed by N compaction threads at the same time. If you have many tables of similar size, then this restriction is not so strong...

Related

What is the effect of number of levels in levelled compaction?

I know how levelled compaction works in DBS like Cassandra, rocksdb etc. Some have max number of levels 4 and some have 7. How does this number affect compaction process? Why can't I have just 2 levels, 1st one which has flushed mem-table data (overlap possible between files) and 2nd one which contains nonoverlapping SSTs?
If there is any doc or duplicate question, please redirect.
Edit-1: Duplicate data increases when the number of levels goes up.
LCS comes to solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request).
Leveled compaction divides the small sstables (“fragments”) into levels:
Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels.
Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).
While solving, or at least significantly improving, the space amplification problem, LCS makes another problem, write amplification, worse.
"Write amplification” is the amount of bytes we had to write to the disk for each one byte of newly flushed sstable data. Write amplification is always higher than 1.0 because we write each piece of data to the commit-log, and then write it again to an sstable, and then each time compaction involves this piece of data and copies it to a new sstable, that’s another write.
Read more about it here:
https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/
https://docs.scylladb.com/kb/compaction/
https://docs.scylladb.com/architecture/compaction/compaction-strategies/
Leveled compaction works Scylla very similarly to how it works in Cassandra and Rocksdb (with some small differences). If you want a short overview on how leveled compaction works in Scylla and why, I suggest that you read my blog post https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/.
Your specific question on why two levels (L0 of recently flushed sstables, Ln of disjoint-range sstables) are not enough - is a very good question:
The main problem is that a single flushed memtable (sstable in L0), containing a random collection of writes, will often intersect all of the sstables in Ln. This means rewriting the entire database every time there's a new memtable flushed, and the result is a super-huge amount of write amplification, which is completely unacceptable.
One way to reduce this write amplification significantly (but perhaps not enough) is to introduce a cascade of intermediate levels, L0, L1, ..., Ln. The end result is that we have L(n-1) which is 1/10th (say) the size of Ln, and we merge L(n-1) - not a single sstable - into Ln. This is the approach that leveled compaction strategy (LCS) uses in all systems you mentioned.
A completely different approach could be not to merge a single sstable into Ln, but rather try to collect a large amount of data first, and only then merge it into Ln. We can't just collect 1,000 tables in L0 because this would make reads very slow. Rather, to collect this large amount of data, one could use size-tiered compaction (STCS) inside L0. In other words, this approach is a "mix" of STCS and LCS with two "levels": L0 uses STCS on new sstables, Ln contains a run of sstables (sstables with disjoint ranges). When L0 reaches 1/10th (say) the size of Ln, L0 is compacted into Ln. Such a mixed approach could have lower write amplification than LCS, but because most of the data is in a run in Ln, it would have same low space and read amplifications as in LCS. None of the mentioned databases (Scylla, Cassandra, or Rocksdb) has such "mixed" compaction supported, as far as I know.

Disk space not decreasing after gc_grace_seconds (10 days) elapsed

I deleted a lot of data(10 billions rows) from my table (made a small app that query from LONG.MIN_VALUE up to LONG.MAX_VALUE in token range and DELETE some data).
Disk space did not decrease after 20 days from then (also I run nodetool repair on 1 node from total of 6), but number of keys(estimate) have decrease accordingly.
Will the space decrease in the future in a natural way, or there is some utility from cassandra I need to run to reclaim the space?
In general, yes, the space will decrease accordingly (once compaction runs). Depending on the compaction strategy chosen for that table, it could take some time. Size Tiered Compaction Strategy for example requires, by default, that 4 sstables be the same size before being compacted. If you have very large SSTABLES then they may not get compacted for quite some time, or indefinitely if there are not 4 of the same size. A manual compaction would fix that situation, but it would put everything in a single sstable, which is not recommended either. If the resulting sstable of a manual compaction is very small, then it won't hurt you. If it ends up compacting to a "large" SSTABLE, then you have sacrificed "now" for "later" (again, because you now have only a single large sstable, it may take a very long time for it to participate in compaction). You can split the sstable after a manual compaction to remidy the situation you've created, but you'll have to take your node off-line to do it. Anyway, short answer is that over time the table should shrink accordingly - when depends on the compaction strategy chosen.
Try running "nodetool garbagecollect" as this will trigger compaction and removes deleted data. which you can verify running status by "nodetool compacationstats"

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

Cassandra Leveled Compaction vs TimeWindowCompactionStrategy

The idea behind TimeWindowCompactionStrategy is each SSTable has records from only a particular time window, instead of records from different time windows getting mixed with each other.
Doesn't Leveled Compaction result in something similar? SSTables are compacted with other SSTables from the same level, which are all from the same time window. (aka SSTables at higher levels are always older). This looks very similar to DateTieredCompactionStrategy, except that the SSTable size is determined by max size in MBs instead of a time window.
LeveledCS is grouping SSTables by size in a multilevel structure, while TimeWindowCS is making same-interval SSTables (thus it's a single level structure) and has limitations on number of buckets so tables with TWCS requires TTL for all rows.
You are correct about difference between DTCS and LCS.
P.S. I recommend to watch the slides from presentation by the author of TWCS to get the reasoning behind it.

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources