New cassandra node should finish compaction before joining ring - cassandra

I’d like to know if there is a way to have a Cassandra node join the ring only after it has finished streaming and compaction. The issue I’m experiencing is that when I add a node to my cluster, it streams data from the other nodes then joins the ring, at this point it begins a lot of compactions, and the compactions take a very long time to complete (greater than a day), during this time CPU utilization on that node is nearly 100%, and bloom filter false positive ratio is very high as well which happens to be relevant to my use case. This causes the whole cluster to experience an increase in read latency, with the newly joined node in particular having 10x the typical latency for reads.
I read this post http://www.datastax.com/dev/blog/bootstrapping-performance-improvements-for-leveled-compaction which has this snippet about one way to possibly improve read latency when adding a node.
“Operators usually avoid this issue by passing -Dcassandra.join_ring=false when starting the new node and wait for the bootstrap to finish along with the followup compactions before putting the node into the ring manually with nodetool join”
The documentation on the join_ring option is pretty limited but after experimenting with it it seems that streaming data and the later compaction can’t be initiated until after I run nodetool join for the new host, so I’d like to know how or if this can be achieved.
Right now my use case is just deduping records being processed by a kafka consumer application. The table in cassandra is very simple, just a primary key, and the queries are just inserting new keys with a ttl of several days and checking existence of a key. The cluster needs to perform 50k reads and 50k writes per second at peak traffic.
I’m running cassandra 3.7 My cluster is in EC2 originally on 18 m3.2xlarge hosts. Those hosts were running at very high (~90%) CPU utilization during compactions which was the impetus for trying to add new nodes to the cluster, I’ve since switched to c3.4xlarge to give more CPU without having to actually add hosts, but it’d be helpful to know at what CPU threshold I should be adding new hosts since waiting until 90% is clearly not safe, and adding new hosts exacerbates the CPU issue on the newly added host.
CREATE TABLE dedupe_hashes (
app int,
hash_value blob,
PRIMARY KEY ((app, hash_value))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '90PERCENTILE';

One thing to check that may help avoid the 100% CPU due to compactions is the setting for "compaction_throughput_mb_per_sec".
By default it should be 16MB but check if you have this disabled (set to 0).
nodetool getcompactionthroughput
You can also set the value dynamically
nodetool setcompactionthroughput 16
(set in cassandra.yaml to persist after restart).

Related

Read latency in Cassandra cluster - too many SSTables

We are facing read latency issues on our Cassandra cluster. One of the reason, I read about, is too many SSTables used in read query. As per documents available online, 1-3 SSTables should be queried for 99%ile read queries. However in my case, we are using upto 20 SSTables.
(I have already worked on tuning other parameters like read-ahead, concurrent-read threads etc)
Here is the output of tablehistogram command for one of the table.
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 51.01 43388.63 179 3
75% 14.00 73.46 62479.63 642 12
95% 17.00 126.93 107964.79 5722 124
98% 20.00 152.32 129557.75 14237 310
99% 20.00 182.79 129557.75 24601 535
Min 0.00 14.24 51.01 51 0
Max 24.00 74975.55 268650.95 14530764 263210
First, I thought maybe compaction is lagging, but that is not the case. I checked and there are always 0 pending tasks in the output of compactionstatus command. I increased the compaction throughput and concurrent compactors just to be on the safer side.
CPU usage, memory usage, and disk IO/IOPS are under control.
We are using the default compaction strategy. Here are the table metadata.
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 7776000
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Also, as per compaction history, I can see compaction happening on some tables once in a day, once in 3 days for another table.
Looks like, the SSTable size is not matching to perform the compaction.
Can you please suggest what can be done here to reduce the number of SSTables?
You can make compaction a bit more aggressive by changing min_threshold parameter of the compaction setting. In the default configuration it's waiting until there are at least 4 files of similar size available, and only after that, trigger compaction. Start with 3, maybe you can lower it to 2, but you really need to track resource consumption so compaction won't add a lot of overhead.
Check this document from the DataStax field team who did a lot of tuning for DataStax customers.
You can try to move to level compaction strategy, it's a better one if you have lots of updates. Another option is to force a major compaction.
In ScyllaDB we have incremental compaction strategy which combines the best of size tiered and level triggered.
You need to make sure that your queries are retrieving data from a single partition otherwise they will significantly affect the performance of your cluster.
If your queries target one partition only but still need to retrieve data from 20 SSTables, it indicates to me that you are constantly inserting/updating partitions and the data gets fragmented across multiple SSTables so Cassandra has to retrieve all the fragments and coalesce them to return the results to the client.
If the size of the SSTables are very small (only a few kilobytes) then there's a good chance that your cluster is getting overloaded with writes and nodes are constantly flushing memtables to disk so you end up with tiny SSTables.
If the SSTables are not getting compacted together, it means that the file sizes are widely different. By default, SSTables get merged together if their size in kilobytes are within 0.5-1.5x the average size of SSTables.
Alex Ott's suggestion to reduce min_threshold to 2 will help speed up compactions. But you really need to address the underlying issue. Don't be tempted to run nodetool compact without understanding the consequences and tradeoffs as I've discussed in this post. Cheers!

Cassandra constant tombstone compaction of table

I have a couple of Cassandra tables on which tombstone compaction is constantly being run and I believe this is the reason behind high CPU usage by the Cassandra process.
Settings I have include:
compaction = {'tombstone_threshold': '0.01',
'tombstone_compaction_interval': '1', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
default_time_to_live = 1728000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
In one of the tables I write data to it every minute. Because of the TTL that is set, a whole set of rows expire every minute too.
Is the constant compaction due to the low tombstone_threshold and tombstone_compaction_interval ?
Can someone give a detailed explanation of tombstone_threshold and tombstone_compaction_interval. The Cassandra document doesn't explain it too well.
So the tombstone compaction can fire assuming the SSTable is as least as old as the compaction interval. SStables are created as things are compacted. the threshold is how much of the sstable is tombstones before compacting just for tombstones instead of joining sstables.
You are using leveled and have a 20 day ttl it looks like. You will be doing a ton of compactions as well as tombstone compactions just to keep up. Leveled will be the best to make sure you don't have old tombstone eating up disk space of the default compactors.
If this data is time-series which is sounds like it is you may want to consider using TWCS instead. This will create "buckets" which are each an sstable once compacted so once the ttl for the data in that table expires the compactor can drop the whole sstable which is much more efficient.
TWCS is available as a jar you need to add to the classpath for 2.1 and we use it currently in production. It has been added in the 3.x series of Cassandra as well.

Cassandra taking too much memory for the data being written

We are using Apache Cassandra 3.0.7 version and off late we see that 90% of memory is occupied on almost all nodes, even though disk is hardly used. We have a cluster of 5 nodes with 15 GB memory, 4 cores, 200 GB SSD each.
We tried all kind of configurations through both YAML as well as table level properties but none seem to help. Memory usage constantly increases almost in direct proportion to data.
Considering the fact that our application is a write-intensive one, we are okay with reduced read performance but would like to utilize as less memory as possible. To do this, our idea was to disable all caches possible or avoid keeping anything not-necessary in memory. But nothing so far seem to help.
​Here's our yaml: http://pastebin.com/yeRGcHRt
and here's our table configuration:
CREATE KEYSPACE if not exists test_ks WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'}; CREATE TABLE if not
exists test_ks.test_cf (id bigint PRIMARY KEY,key_str text,value1
int,value2 int,update_ts bigint) WITH bloom_filter_fp_chance = 1 AND
comment = '' AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'} AND compression =
{'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance
= 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 10240 AND
memtable_flush_period_in_ms = 3600000 AND min_index_interval = 10240
AND read_repair_chance = 0.0 AND speculative_retry = '99PERCENTILE'
AND caching = {'keys': 'NONE', 'rows_per_partition': 'NONE'};
We have seen that most of the consumption is on off-heap, heap memory is capped at 4.5 G. So out of total 14 G on a node, only 4.5G is consumed by heap.
Has anyone tried such configuration before? Please let us know if disabling cache would help us in this situation. And if yes, how we can we disable cache completely. Looking forward to your help.
We are experiencing a similar problem. After upgrading from Cassandra 2.x to 3.11.0, Cassandra is using <2GB on-heap and >10GB off-heap, on a use case that didn't have any problems before. This results in the (Windows) machine staying pegged at 99.5% memory usage continually. Heap memory is similarly capped at 2GB.
Most caching values are left to the defaults; in particular the row cache is disabled.
EDIT: I have a better answer. It appears (still testing) that the slowness in our case was because Windows' page file was not disabled. Cassandra recommends disabling the swap file on Linux or the page file on Windows. It also outputs a warning on startup if a swap or page file is detected.
Cassandra's off-heap memory, at least on Windows, is mostly due to memory-mapped IO of files, which is apparently (from reading the Cassandra issue tracker) significantly faster. However, if a swap/page file is enabled, things are forced out of physical memory by mmapped files and experience a huge slowdown swapping to disk. Disabling the page file on Windows in our testing appears to mitigate this significantly. Cassandra is still using lots of memory for mmapped files, but as no memory is being swapped to disk, some combination of Cassandra and the OS properly free up the mmapped files so that other processes can run smoothly. I used this tool to confirm the presence of mmapped files on Windows.
Try set -XX:MaxDirectMemorySize. It will limit the use of off-heap memory
To decrease used memory try to set next parameters
MAX_HEAP_SIZE, HEAP_NEWSIZE
in cassandra-env.sh to values you want

Cassandra timeout during read query at consistency LOCAL_ONE

I have a single-node cassandra cluster, 32 cores CPU, 32GB memory and RAID of 3 SSDs, totally around 2.5TB. and i also have another host with 32 cores and 32GB memory, on which i run a Apache Spark.
I have a huge history data in cassandra, maybe 600GB. There're approx more than 1 million new records every day which come from Kafka. And I need to query these new rows every day. But Cassandra failed. I'm confused.
My scheme of Cassandra table is:
CREATE TABLE rainbow.activate (
rowkey text,
qualifier text,
act_date text,
info text,
log_time text,
PRIMARY KEY (rowkey, qualifier)
) WITH CLUSTERING ORDER BY (qualifier ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activate_act_date_idx ON rainbow.activate (act_date);
CREATE INDEX activate_log_time_idx ON rainbow.activate (log_time);
cause the source data maybe contains some duplicative data, so i need to use a primary key to drop the duplicative records. there're two index on this table, the act_date is a date string like '20151211', the log_time is a datetime string like '201512111452', that is the log_time separates records more finer.
if i select records using log_time, cassandra works. but it fails using act_date.
at the first, spark job exit with an error:
java.io.IOException: Exception during execution of SELECT "rowkey", "qualifier", "info" FROM "rainbow"."activate" WHERE token("rowkey") > ? AND token("rowkey") <= ? AND log_time = ? ALLOW FILTERING: All host(s) tried for query failed (tried: noah-cass01/192.168.1.124:9042 (com.datastax.driver.core.OperationTimedOutException: [noah-cass01/192.168.1.124:9042] Operation timed out))
i try to increase the spark.cassandra.read.timeout_ms to 60000. But the job post another error as follow:
java.io.IOException: Exception during execution of SELECT "rowkey", "qualifier", "info" FROM "rainbow"."activate" WHERE token("rowkey") > ? AND token("rowkey") <= ? AND act_date = ? ALLOW FILTERING: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
i don't know how to solve this problem, i read the docs on spark-cassandra-connector but i don't find any tips.
so would you like to give some advise to help me solve this problem.
thanks very much!
Sounds like an unusual setup. If you have two machines it would be more efficient to configure Cassandra as two nodes and run Spark on both nodes. That would spread the data load and you'd generate a lot less traffic between the two machines.
Ingesting so much data every day and then querying arbitrary ranges of it sounds like a ticking time bomb. When you start getting frequent time out errors it is usually a sign of an inefficient schema where Cassandra cannot do what you are asking in an efficient way.
I don't see the specific cause of the problem, but I'd consider adding another field to the partition key, such as the day so that you could restrict your queries to a smaller subset of your data.

Cassandra low read performance with high SSTable count

I am building an application which process very large data(more that 3 million).I am new to cassandra and I am using 5 node cassandra cluster to store data. I have two column families
Table 1 : CREATE TABLE keyspace.table1 (
partkey1 text,
partkey2 text,
clusterKey text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey1)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Table 2 : CREATE TABLE keyspace.table2 (
partkey1 text,
partkey2 text,
clusterKey2 text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey2)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
note : clusterKey1 and clusterKey2 are randomly generated UUID's
My concern is on nodetool cfstats
I am getting good throughput on Table1 with stats :
SSTable count: 2
Space used (total): 365189326
Space used by snapshots (total): 435017220
SSTable Compression Ratio: 0.2578485727722293
Memtable cell count: 18590
Memtable data size: 3552535
Memtable switch count: 171
Local read count: 0
Local read latency: NaN ms
Local write count: 2683167
Local write latency: 1.969 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 352
where as for table2 I am getting very bad read performance with stats :
SSTable count: 33
Space used (live): 212702420
Space used (total): 212702420
Space used by snapshots (total): 262252347
SSTable Compression Ratio: 0.1686948750752438
Memtable cell count: 40240
Memtable data size: 24047027
Memtable switch count: 89
Local read count: 24027
Local read latency: 0.580 ms
Local write count: 1075147
Local write latency: 0.046 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 688
I was wondering why table2 is creating 33 SSTables and why is the read performance very low in it. Can anyone help me figure out what I am doing wrong here?
This is how I query the table :
BoundStatement selectStamt;
if (selectStamt == null) {
PreparedStatement prprdStmnt = session
.prepare("select * from table2 where clusterKey1 = ? and partkey1=? and partkey2=?");
selectStamt = new BoundStatement(prprdStmnt);
}
synchronized (selectStamt) {
res = session.execute(selectStamt.bind("clusterKey", "partkey1", "partkey2"));
}
In another thread, I am doing some update operations on this table on different data the same way.
In case of measuring throughput, I measuring number of records processed per sec and its processing only 50-80 rec.
When you have a lot of SSTables, the distribution of your data among those SSTables is very important. Since you are using SizeTieredCompactionStrategy, SSTables get compacted and merged approximately when there are 4 SSTables the same size.
If you are updating data within the same partition frequently and at different times, it's likely your data is spread across many SSTables which is going to degrade performance as there will be multiple reads of your SSTables.
In my opinion, the best way to confirm this is to execute cfhistograms on your table:
nodetool -h localhost cfhistograms keyspace table2
Depending on the version of cassandra you have installed, the output will be different, but it will include a histogram of number of SSTables read for a given read operation.
If you are updating data within the same partition frequently and at different times, you could consider using LeveledCompactionStrategy (When to use Leveled Compaction). LCS will keep data from the same partition together in the same SSTable within a level which greatly improves read performance, at the cost of more Disk I/O doing compaction. In my experience, the extra compaction disk I/O more than pays off in read performance if you have a high ratio of reads to writes.
EDIT: With regards to your question about your throughput concerns, there are a number of things that are limiting your throughput.
A possible big issue is that unless you have many threads making that same query at a time, you are making your request serially (one at a time). By doing this, you are severely limiting your throughput as another request can not be sent until you get a response from Cassandra. Also, since you are synchronizing on selectStmt, even if this code were being executed by multiple threads, only one request could be executed at a time anyways. You can dramatically improve throughput by having multiple worker threads that make the request for you (if you aren't already doing this), or even better user executeAsync to execute many requests asynchronously. See Asynchronous queries with the Java driver for an explanation on how the request process flow works in the driver and how to effectively use the driver to make many queries.
If you are executing this same code each time you make a query, you are creating an extra roundtrip by calling 'session.prepare' each time to create your PreparedStatement. session.prepare sends a request to cassandra to prepare your statement. You only need to do this once and you can reuse the PreparedStatement each time you make a query. You may be doing this already given your statement null-checking (can't tell without more code).
Instead of reusing selectStmt and synchronizing on it, just create a new BoundStatement off of the single PreparedStatement you are using each time you make a query. This way no synchronization is needed at all.
Aside from switching compaction strategies (this is expensive, you will compact hard for a while after the change) which as Andy suggests will certainly help your read performance, you can also tune your current compaction strategy to try to get rid of some of the fragmentation:
If you have pending compactions (nodetool compactionstats) -- then try to catch up by increasing compactionthrottling. Keep concurrent compactors to 1/2 of your CPU cores to avoid compaction from hogging all your cores.
Increase bucket size (increase bucket_high, drop bucket low)- dictates how similar sstables have to be in size to be compacted together.
Drop Compaction threshold - dictates how many sstables must fit in a bucket before compaction occurs.
For details on 2 and 3 check out compaction subproperties
Note: do not use nodetool compact. This will put the whole table in one huge sstable and you'll loose the benefits of compacting slices at a time.
In case of emergencies use JMX --> force user defined compaction to force minor compactions
You have many SSTable's and slow reads. The first thing you should do is to find out how many SSTable's are read per SELECT.
The easiest way is to inspect the corresponding MBean: In the MBean domain "org.apache.cassandra.metrics" you find your keyspace, below it your table and then the SSTablesPerReadHistorgram MBean. Cassandra records min, max, mean and also percentiles.
A very good value for the 99th percentile in SSTablesPerReadHistorgram is 1, which means you normally read only from a single table. If the number is about as high as the number of SSTable's, Cassandra is inspecting all SSTable's. In the latter case you should double-check your SELECT, whether you are doing a select on the whole primary key or not.

Resources