I've read that you can set compaction strategy per table in Cassandra/Scylla, as described here https://docs.scylladb.com/operating-scylla/procedures/config-change/change_compaction/
The default compaction strategy is Size-tiered compaction strategy (STCS).
But is there a way to change it somehow, in the settings, such that each table that's created uses another compaction strategy by default?
Thanks.
The compaction strategy is a sub-property of the compaction configuration of each table so you will need to use the CQL ALTER TABLE command to choose a different compaction strategy other than the default.
In almost all cases, the SizeTieredCompationStrategy (STCS) is the right choice and so it is the default. There are very limited cases where you would choose a different compaction strategy.
The most common situation where you would change it is if you have a time-series use case where TimeWindowCompactionStrategy (TWCS) is recommended. LeveledCompactionStrategy (LCS) is only valid for workloads were there is very little writes and your app is almost exclusively doing reads.
So unless you fit into these narrow use cases, STCS should be your choice of compaction strategy. Cheers!
Related
I have these 2 particular use cases:
Streaming jobs, writing 30mb every 5 seconds
Batch jobs, writing 500 gb every morning
The TTL of my tables in 1,5 years.
These writes can contain many updates, so, according to this table right here:
I should use the SizeTieredCompactionStrategy. However, how do I choose the correct parameters for it?
It has several parameters:
bucket_high
bucket_low
min_sstable_size
min_threshold
max_threshold
As a general proposition, it is very rare for operators to have to configure the size-tiered compaction sub-properties.
Unless you're very experienced with Cassandra, there just isn't any reason to reconfigure the defaults for STCS. That is why it is default compaction strategy out-of-the-box and is suitable for majority of workloads.
The exceptions are using TWCS for true time-series use cases and LCS for very read-heavy with hardly any writes. Cheers!
In my use case I have data that will be updated frequently during one day after its insertion and after that will be used rarely for reads only. What would be the best compaction strategy for that? TWCS or DTCS?
TWCS was created because DTCS requires a lot of tuning and operational care in order to achieve & maintain good performance. TWCS provides a similar level of performance to DTCS and is much easier to work with, so it is definitely the one to use for 99% of cases where time-series data is involved and there will be no inserts/updates after the first window.
Take a look at CASSANDRA-9666 for the details.
Twcs is the best compaction strategy for time serious data if you have frequent updates with less number of reads .
http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
I consider a counter ColumnFamily. Since it holds only counters, I expect to see a large number of updates in this table.
Following http://www.datastax.com/dev/blog/when-to-use-leveled-compaction, I consider using compaction = LeveledCompactionStrategy.
Is this a good idea? If yes, I would have expected counter ColumnFamilies to have compaction=LeveledCompactionStrategy by default, which seems not to be the case.
All tables (ColumnFamilies) are set to size tiered compaction strategy by default. There is no distinction based on the type. This is due to the fact that choosing LCS requires thoughtful consideration. The url you link to is the standard for determining whether or not you should enable LCS. In general, it is not a good idea to do so unless your storage is SSDs or if your use case is write-heavy.
As is mentioned in the article, you can test the impact of LCS using write-survey mode. I would urge you to make use of this feature before making the switch.
I use Cassandra for gathering time series measurements. To enable nice partitioning, beside device-id I added day-from-UTC-beginning and a bucket created on the basis of a written measurement. The time is added as a clustering key. The final key can be written as
((device-id, day-from-UTC-beginning, bucket), measurement-uuid)
Queries against this schema in majority of cases take whole rows with the given device-id and day-from-UTC-beginning using IN for buckets. Because of this query schema Leveled Compaction looked like a perfect match, as it ensures with great probability that a row is held by one SSTable.
Running incremental repair was fine, when appending to the table was disabled. Once, the repair was run under the write pressure, lots of streaming was involved. It looked like more data was streamed than was appended after the last repair.
I've tried using multiple tables, one for each day. When a day ended and no further writes were made to a given table, repair was running smoothly. I'm aware of thousands of tables overhead though it looks like it's only one feasible solution.
What's the correct way of combining Leveled Compaction with incremental repairs under heavy write scenario?
Leveled Compaction is not a good idea when you have a write heavy workload. It is better for a read/write mixed workload when read latency matters. Also if your cluster is already pressed for I/O, switching to leveled compaction will almost certainly only worsen the problem. So ensure you have SSDs.
At this time size tiered is the better choice for a write heavy workload. There are some improvements in 2.1 for this though.
I have Cassandra table
CREATE TABLE schema1 (
key bigint,
lowerbound bigint,
upperbound bigint,
data blob,
PRIMARY KEY (key, lowerbound,upperbound)
) WITH COMPACT STORAGE ;
I want to perform a range query by using CQL
Select lowerbound, upperbound from schema1 where key=(some key) and lowerbound<=123 order by lowerbound desc limit 1 allow filtering;
Any Suggsetion please Regarding the compaction strategy
Note MY read:write ration is 1:1
Size-tiered compaction is the default, and should be appropriate for most use-cases. In 2012 DataStax posted an article titled When To Use Leveled Compaction, in which it specified three (main) conditions for which leveled compaction was a good idea:
High Sensitivity to Read Latency (your queries need to meet a latency SLA in the 99th percentile).
High Read/Write Ratio
Rows Are Frequently Updated
It also identifies three scenarios when leveled compaction is not a good idea:
Your Disks Can’t Handle the Compaction I/O
Write-heavy Workloads
Rows Are Write-Once
Note how none of the six scenarios I mentioned above are specific to range queries.
My question would be "what problem are you trying to fix?" You mentioned "performing better," but I have found that query performance issues tend to be more tied to data model design. Switching the compaction strategy isn't going to help much if you're running with an inefficient primary key strategy. By virtue of the fact that your query requires ALLOW FILTERING, I would say that changing compaction strategy isn't going to help much.
The DataStax docs contain a section on Slicing over partition rows, which appears to be somewhat similar to your query. Give it a look and see if it helps.
Leveled compaction will mean fewer SSTables are involved for your queries on a key, but requires extra IO. Also, during compaction it uses 10% more disk than data, while for size tiered compaction, you need double. Which is better depends on your setup, queries, etc. Are you experiencing performance problems? If not, and if I could deal with the extra IO, I might choose leveled as it means I don't have to keeps 50+% of headroom in terms of disk space for compaction. But again, there's no "one right way".
Perhaps read this:
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
When Rows Are Frequently Updated
From datasatx article
Whether you’re dealing with skinny rows where columns are overwritten frequently (like a “last access” timestamp in a Users column family) or wide rows where new columns are constantly added, when you update a row with size-tired compaction, it will be spread across multiple SSTables. Leveled compaction, on the other hand, keeps the number of SSTables that the row is spread across very low, even with frequent row updates.