LeveledCompactionStrategy : what is the impact of tuning the sstable_size_in_mb? - cassandra

To enhance read performance, I try to have fewer underlying SSTables with LCS, so I set sstable_size_in_mb to 1280MB as suggested by some articles, which pointed out that the 160MB default value was picked out by Cassandra core team a long time ago, on a pretty old server by now with only 2GB RAM. However, my concern is about implications of having a higher value of sstable_size_in_mb.
What I understand is LCS regularly compact all the SSTables in L0 together with all the SSTables in L1, then replacing the entire content of L1. So each time L1 is replaced, the hardware requirements CPU/RAM and write amplification may be higher with a higher value of sstable_size_in_mb. Indeed, if sstable_size_in_mb = 1280MB, so 10 tables of 1280MB in L1 have to be merged each time with all L0 tables. And maybe there are also implications on a higher level, even if the SSTables to replace seems lower (one L1 SSTables is merged with 10 L2 SSTables, then those 10 L2 SSTables are replaced).
Questions :
Having a higher value of sstable_size_in_mb may increase read performance by lowering the number of SSTables involved in a CQL Table. However, what are the others implications to have such higher value (like 1280MB) for sstable_size_in_mb?
If higher value, are there any corresponding configuration to tune (Garbage Collector, chunk cache, ...) to allow better performance for compactions of those larger SSTables, and having less GC activity?
More subjective question, what is the typical value of sstable_size_in_mb you use in your deployment?

To answer your first question, I'd like to quote some original text from Jonathan Ellis in CASSANDRA-5727 when the community initially looked into the sstable_size_in_mb (and subsequently decided on the 160 number).
"larger files mean that each level contains more data, so reads will
have to touch less sstables, but we're also compacting less unchanged
data when we merge forward." (Note: I suspect there was a typo and he meant "we're also compacting more unchanged data when we merge forward", which aligns with what you stated in your second paragraph, and what he meant by larger file impacting "compaction efficiency".)
As for any other implication: it might push the envelope on the LCS node density upper bound, as it would allow much higher density for the same number of SSTables per node.
To answer your second question, compaction does create a lot of churns in the heap, as it creates many short lived objects from SSTables. Given much bigger SSTables involved in the compaction when you use the 1280MB size, you should pay attention to your gc.log and watch out for "Humongous Allocation" messages (if you use G1GC). If they turn out to happen a lot, you can increase the region size to avoid costly collections of humongous objects by using the -XX:G1HeapRegionSize option.
For your third question, as far as I know, many have used the 160MB default value for a long time, as we don't have a comprehensive analysis published on the impact/benefit from benchmarking larger SSTable size with modern hardware yet (I attempted to run some quick tests, but got busy with other things and didn't finish that effort, sorry). However, I do think if people are interested in achieving higher node density with LCS, this SSTable size is a parameter that's worth exploring.

Related

What is the effect of number of levels in levelled compaction?

I know how levelled compaction works in DBS like Cassandra, rocksdb etc. Some have max number of levels 4 and some have 7. How does this number affect compaction process? Why can't I have just 2 levels, 1st one which has flushed mem-table data (overlap possible between files) and 2nd one which contains nonoverlapping SSTs?
If there is any doc or duplicate question, please redirect.
Edit-1: Duplicate data increases when the number of levels goes up.
LCS comes to solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request).
Leveled compaction divides the small sstables (“fragments”) into levels:
Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels.
Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).
While solving, or at least significantly improving, the space amplification problem, LCS makes another problem, write amplification, worse.
"Write amplification” is the amount of bytes we had to write to the disk for each one byte of newly flushed sstable data. Write amplification is always higher than 1.0 because we write each piece of data to the commit-log, and then write it again to an sstable, and then each time compaction involves this piece of data and copies it to a new sstable, that’s another write.
Read more about it here:
https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/
https://docs.scylladb.com/kb/compaction/
https://docs.scylladb.com/architecture/compaction/compaction-strategies/
Leveled compaction works Scylla very similarly to how it works in Cassandra and Rocksdb (with some small differences). If you want a short overview on how leveled compaction works in Scylla and why, I suggest that you read my blog post https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/.
Your specific question on why two levels (L0 of recently flushed sstables, Ln of disjoint-range sstables) are not enough - is a very good question:
The main problem is that a single flushed memtable (sstable in L0), containing a random collection of writes, will often intersect all of the sstables in Ln. This means rewriting the entire database every time there's a new memtable flushed, and the result is a super-huge amount of write amplification, which is completely unacceptable.
One way to reduce this write amplification significantly (but perhaps not enough) is to introduce a cascade of intermediate levels, L0, L1, ..., Ln. The end result is that we have L(n-1) which is 1/10th (say) the size of Ln, and we merge L(n-1) - not a single sstable - into Ln. This is the approach that leveled compaction strategy (LCS) uses in all systems you mentioned.
A completely different approach could be not to merge a single sstable into Ln, but rather try to collect a large amount of data first, and only then merge it into Ln. We can't just collect 1,000 tables in L0 because this would make reads very slow. Rather, to collect this large amount of data, one could use size-tiered compaction (STCS) inside L0. In other words, this approach is a "mix" of STCS and LCS with two "levels": L0 uses STCS on new sstables, Ln contains a run of sstables (sstables with disjoint ranges). When L0 reaches 1/10th (say) the size of Ln, L0 is compacted into Ln. Such a mixed approach could have lower write amplification than LCS, but because most of the data is in a run in Ln, it would have same low space and read amplifications as in LCS. None of the mentioned databases (Scylla, Cassandra, or Rocksdb) has such "mixed" compaction supported, as far as I know.

memtable_flush_writer significance and uses

Can anyone tell about memtable_flush_writers use case and significance. And in what situation we should tune from default value? I have already read the datastax docs but not clear the actual uses and benefits.
By default, memtable_cleanup_threshold is computed as: 1 / ( memtable_flush_writers + 1)
There is some guidance in the YAML about how to set this value, as Mehul pointed out. Contrary to that, I would never set that to number of cores, regardless of whether or not you're using SSDs.
The problems come when the memtable_flush_writers is set too high, your node can become overwhelmed with small flushes that trigger compaction. This has the unfortunate side effect of causing your commitlog to fill up, and eventually get to a point where it cannot keep up with the flush frequency.
If that happens, you can force a flush manually using nodetool flush. But if you see your commitlog filling your disk, lowering your memtable_flush_writers is a good thing to try.
NoteL: As with all "tuning" like changes with Cassandra, I'd make incremental changes over time, as opposed to a drastic change. Just to be on the safe side.
memtable_cleanup_threshold : When the total amount of memory used by all non-flushing memtables exceeds this ratio, Cassandra flushes the largest memtable to disk.
memtable_flush_writers : THis defines the number of memtable flush writer threads. The threads will write parallel on disk (sstables). But changing this parameter is suggest in case solid-state drive (SSD) is used.
Note : If your data directories are backed by SSDs, increase this setting to the number of cores.
I hope this solves your query.

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

TTL tombstones in Cassandra using LCS are created in the same level data TTLed data?

I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

Resources