Cassandra constant tombstone compaction of table - cassandra

I have a couple of Cassandra tables on which tombstone compaction is constantly being run and I believe this is the reason behind high CPU usage by the Cassandra process.
Settings I have include:
compaction = {'tombstone_threshold': '0.01',
'tombstone_compaction_interval': '1', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
default_time_to_live = 1728000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
In one of the tables I write data to it every minute. Because of the TTL that is set, a whole set of rows expire every minute too.
Is the constant compaction due to the low tombstone_threshold and tombstone_compaction_interval ?
Can someone give a detailed explanation of tombstone_threshold and tombstone_compaction_interval. The Cassandra document doesn't explain it too well.

So the tombstone compaction can fire assuming the SSTable is as least as old as the compaction interval. SStables are created as things are compacted. the threshold is how much of the sstable is tombstones before compacting just for tombstones instead of joining sstables.
You are using leveled and have a 20 day ttl it looks like. You will be doing a ton of compactions as well as tombstone compactions just to keep up. Leveled will be the best to make sure you don't have old tombstone eating up disk space of the default compactors.
If this data is time-series which is sounds like it is you may want to consider using TWCS instead. This will create "buckets" which are each an sstable once compacted so once the ttl for the data in that table expires the compactor can drop the whole sstable which is much more efficient.
TWCS is available as a jar you need to add to the classpath for 2.1 and we use it currently in production. It has been added in the 3.x series of Cassandra as well.

Related

Cassandra is tracking the number of deletion in sstables to trigger a compaction?

Wonder whether Cassandra is triggering a compaction (STCS or LCS) based on the number of deletion in sstables? In LCS, as I know, cassandra compacts sstables to next level only if a level is full. But the size of a deletion recored is usually small. If just consider the sstable size to decide whether a level is full or not, it may take long for a tombstone to be reclaimed.
I know rocksdb is triggering compaction using the number of deletions in sstables. This will help to reduce tombstone.
Yes, Cassandra's compaction can be triggered by the number of deletion (a.k.a. tombstones)
Have a look to the common options for all the compaction strategies and specifically this param:
tombstone_threshold
How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable.
See doc here: https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/index.html

Read latency in Cassandra cluster - too many SSTables

We are facing read latency issues on our Cassandra cluster. One of the reason, I read about, is too many SSTables used in read query. As per documents available online, 1-3 SSTables should be queried for 99%ile read queries. However in my case, we are using upto 20 SSTables.
(I have already worked on tuning other parameters like read-ahead, concurrent-read threads etc)
Here is the output of tablehistogram command for one of the table.
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 51.01 43388.63 179 3
75% 14.00 73.46 62479.63 642 12
95% 17.00 126.93 107964.79 5722 124
98% 20.00 152.32 129557.75 14237 310
99% 20.00 182.79 129557.75 24601 535
Min 0.00 14.24 51.01 51 0
Max 24.00 74975.55 268650.95 14530764 263210
First, I thought maybe compaction is lagging, but that is not the case. I checked and there are always 0 pending tasks in the output of compactionstatus command. I increased the compaction throughput and concurrent compactors just to be on the safer side.
CPU usage, memory usage, and disk IO/IOPS are under control.
We are using the default compaction strategy. Here are the table metadata.
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 7776000
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Also, as per compaction history, I can see compaction happening on some tables once in a day, once in 3 days for another table.
Looks like, the SSTable size is not matching to perform the compaction.
Can you please suggest what can be done here to reduce the number of SSTables?
You can make compaction a bit more aggressive by changing min_threshold parameter of the compaction setting. In the default configuration it's waiting until there are at least 4 files of similar size available, and only after that, trigger compaction. Start with 3, maybe you can lower it to 2, but you really need to track resource consumption so compaction won't add a lot of overhead.
Check this document from the DataStax field team who did a lot of tuning for DataStax customers.
You can try to move to level compaction strategy, it's a better one if you have lots of updates. Another option is to force a major compaction.
In ScyllaDB we have incremental compaction strategy which combines the best of size tiered and level triggered.
You need to make sure that your queries are retrieving data from a single partition otherwise they will significantly affect the performance of your cluster.
If your queries target one partition only but still need to retrieve data from 20 SSTables, it indicates to me that you are constantly inserting/updating partitions and the data gets fragmented across multiple SSTables so Cassandra has to retrieve all the fragments and coalesce them to return the results to the client.
If the size of the SSTables are very small (only a few kilobytes) then there's a good chance that your cluster is getting overloaded with writes and nodes are constantly flushing memtables to disk so you end up with tiny SSTables.
If the SSTables are not getting compacted together, it means that the file sizes are widely different. By default, SSTables get merged together if their size in kilobytes are within 0.5-1.5x the average size of SSTables.
Alex Ott's suggestion to reduce min_threshold to 2 will help speed up compactions. But you really need to address the underlying issue. Don't be tempted to run nodetool compact without understanding the consequences and tradeoffs as I've discussed in this post. Cheers!

New cassandra node should finish compaction before joining ring

I’d like to know if there is a way to have a Cassandra node join the ring only after it has finished streaming and compaction. The issue I’m experiencing is that when I add a node to my cluster, it streams data from the other nodes then joins the ring, at this point it begins a lot of compactions, and the compactions take a very long time to complete (greater than a day), during this time CPU utilization on that node is nearly 100%, and bloom filter false positive ratio is very high as well which happens to be relevant to my use case. This causes the whole cluster to experience an increase in read latency, with the newly joined node in particular having 10x the typical latency for reads.
I read this post http://www.datastax.com/dev/blog/bootstrapping-performance-improvements-for-leveled-compaction which has this snippet about one way to possibly improve read latency when adding a node.
“Operators usually avoid this issue by passing -Dcassandra.join_ring=false when starting the new node and wait for the bootstrap to finish along with the followup compactions before putting the node into the ring manually with nodetool join”
The documentation on the join_ring option is pretty limited but after experimenting with it it seems that streaming data and the later compaction can’t be initiated until after I run nodetool join for the new host, so I’d like to know how or if this can be achieved.
Right now my use case is just deduping records being processed by a kafka consumer application. The table in cassandra is very simple, just a primary key, and the queries are just inserting new keys with a ttl of several days and checking existence of a key. The cluster needs to perform 50k reads and 50k writes per second at peak traffic.
I’m running cassandra 3.7 My cluster is in EC2 originally on 18 m3.2xlarge hosts. Those hosts were running at very high (~90%) CPU utilization during compactions which was the impetus for trying to add new nodes to the cluster, I’ve since switched to c3.4xlarge to give more CPU without having to actually add hosts, but it’d be helpful to know at what CPU threshold I should be adding new hosts since waiting until 90% is clearly not safe, and adding new hosts exacerbates the CPU issue on the newly added host.
CREATE TABLE dedupe_hashes (
app int,
hash_value blob,
PRIMARY KEY ((app, hash_value))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '90PERCENTILE';
One thing to check that may help avoid the 100% CPU due to compactions is the setting for "compaction_throughput_mb_per_sec".
By default it should be 16MB but check if you have this disabled (set to 0).
nodetool getcompactionthroughput
You can also set the value dynamically
nodetool setcompactionthroughput 16
(set in cassandra.yaml to persist after restart).

Cassandra low read performance with high SSTable count

I am building an application which process very large data(more that 3 million).I am new to cassandra and I am using 5 node cassandra cluster to store data. I have two column families
Table 1 : CREATE TABLE keyspace.table1 (
partkey1 text,
partkey2 text,
clusterKey text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey1)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Table 2 : CREATE TABLE keyspace.table2 (
partkey1 text,
partkey2 text,
clusterKey2 text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey2)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
note : clusterKey1 and clusterKey2 are randomly generated UUID's
My concern is on nodetool cfstats
I am getting good throughput on Table1 with stats :
SSTable count: 2
Space used (total): 365189326
Space used by snapshots (total): 435017220
SSTable Compression Ratio: 0.2578485727722293
Memtable cell count: 18590
Memtable data size: 3552535
Memtable switch count: 171
Local read count: 0
Local read latency: NaN ms
Local write count: 2683167
Local write latency: 1.969 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 352
where as for table2 I am getting very bad read performance with stats :
SSTable count: 33
Space used (live): 212702420
Space used (total): 212702420
Space used by snapshots (total): 262252347
SSTable Compression Ratio: 0.1686948750752438
Memtable cell count: 40240
Memtable data size: 24047027
Memtable switch count: 89
Local read count: 24027
Local read latency: 0.580 ms
Local write count: 1075147
Local write latency: 0.046 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 688
I was wondering why table2 is creating 33 SSTables and why is the read performance very low in it. Can anyone help me figure out what I am doing wrong here?
This is how I query the table :
BoundStatement selectStamt;
if (selectStamt == null) {
PreparedStatement prprdStmnt = session
.prepare("select * from table2 where clusterKey1 = ? and partkey1=? and partkey2=?");
selectStamt = new BoundStatement(prprdStmnt);
}
synchronized (selectStamt) {
res = session.execute(selectStamt.bind("clusterKey", "partkey1", "partkey2"));
}
In another thread, I am doing some update operations on this table on different data the same way.
In case of measuring throughput, I measuring number of records processed per sec and its processing only 50-80 rec.
When you have a lot of SSTables, the distribution of your data among those SSTables is very important. Since you are using SizeTieredCompactionStrategy, SSTables get compacted and merged approximately when there are 4 SSTables the same size.
If you are updating data within the same partition frequently and at different times, it's likely your data is spread across many SSTables which is going to degrade performance as there will be multiple reads of your SSTables.
In my opinion, the best way to confirm this is to execute cfhistograms on your table:
nodetool -h localhost cfhistograms keyspace table2
Depending on the version of cassandra you have installed, the output will be different, but it will include a histogram of number of SSTables read for a given read operation.
If you are updating data within the same partition frequently and at different times, you could consider using LeveledCompactionStrategy (When to use Leveled Compaction). LCS will keep data from the same partition together in the same SSTable within a level which greatly improves read performance, at the cost of more Disk I/O doing compaction. In my experience, the extra compaction disk I/O more than pays off in read performance if you have a high ratio of reads to writes.
EDIT: With regards to your question about your throughput concerns, there are a number of things that are limiting your throughput.
A possible big issue is that unless you have many threads making that same query at a time, you are making your request serially (one at a time). By doing this, you are severely limiting your throughput as another request can not be sent until you get a response from Cassandra. Also, since you are synchronizing on selectStmt, even if this code were being executed by multiple threads, only one request could be executed at a time anyways. You can dramatically improve throughput by having multiple worker threads that make the request for you (if you aren't already doing this), or even better user executeAsync to execute many requests asynchronously. See Asynchronous queries with the Java driver for an explanation on how the request process flow works in the driver and how to effectively use the driver to make many queries.
If you are executing this same code each time you make a query, you are creating an extra roundtrip by calling 'session.prepare' each time to create your PreparedStatement. session.prepare sends a request to cassandra to prepare your statement. You only need to do this once and you can reuse the PreparedStatement each time you make a query. You may be doing this already given your statement null-checking (can't tell without more code).
Instead of reusing selectStmt and synchronizing on it, just create a new BoundStatement off of the single PreparedStatement you are using each time you make a query. This way no synchronization is needed at all.
Aside from switching compaction strategies (this is expensive, you will compact hard for a while after the change) which as Andy suggests will certainly help your read performance, you can also tune your current compaction strategy to try to get rid of some of the fragmentation:
If you have pending compactions (nodetool compactionstats) -- then try to catch up by increasing compactionthrottling. Keep concurrent compactors to 1/2 of your CPU cores to avoid compaction from hogging all your cores.
Increase bucket size (increase bucket_high, drop bucket low)- dictates how similar sstables have to be in size to be compacted together.
Drop Compaction threshold - dictates how many sstables must fit in a bucket before compaction occurs.
For details on 2 and 3 check out compaction subproperties
Note: do not use nodetool compact. This will put the whole table in one huge sstable and you'll loose the benefits of compacting slices at a time.
In case of emergencies use JMX --> force user defined compaction to force minor compactions
You have many SSTable's and slow reads. The first thing you should do is to find out how many SSTable's are read per SELECT.
The easiest way is to inspect the corresponding MBean: In the MBean domain "org.apache.cassandra.metrics" you find your keyspace, below it your table and then the SSTablesPerReadHistorgram MBean. Cassandra records min, max, mean and also percentiles.
A very good value for the 99th percentile in SSTablesPerReadHistorgram is 1, which means you normally read only from a single table. If the number is about as high as the number of SSTable's, Cassandra is inspecting all SSTable's. In the latter case you should double-check your SELECT, whether you are doing a select on the whole primary key or not.

Cassandra control SSTable size

Is there a way I could control max size of a SSTable, for example 100 MB so that when there is actually more than 100MB of data for a CF, then Cassandra creates next SSTable?
Unfortunately the answer is not so simple, the sizes of your SSTables will be influenced by your compaction Strategy and there is no direct way to control your max sstable size.
SSTables are initially created when memtables are flushed to disk as SSTables. The size of these tables initially depends on your memtable settings and the size of your heap (memtable_total_space_in_mb being a large influencer). Typically these SSTables are pretty small. SSTables get merged together as part of a process called compaction.
If you use Size-Tiered Compaction Strategy you have an opportunity to have really large SSTables. STCS will combine SSTables in a minor compaction when there are at least min_threshold (default 4) sstables of the same size by combining them into one file, expiring data and merging keys. This has the possibility to create very large SSTables after a while.
Using Leveled Compaction Strategy there is a sstable_size_in_mb option that controls a target size for SSTables. In general SSTables will be less than or equal to this size unless you have a partition key with a lot of data ('wide rows').
I haven't experimented much with Date-Tiered Compaction Strategy yet, but that works similar to STCS in that it merges files of the same size, but it keeps data together in time order and it has a configuration to stop compacting old data (max_sstable_age_days) which could be interesting.
The key is to find the compaction strategy which works best for your data and then tune the properties around what works best for your data model / environment.
You can read more about the configuration settings for compaction here and read this guide to help understand whether STCS or LCS is appropriate for you.

Resources