cassandra 3.11.2 memory consumption - cassandra

I have a cassandra-3.11.2 cluster with three nodes (cassandra1-3) in GCP (Google cloud) with Centos7 as OS. n1-highmem-2 with 5TB PD disk attached to each of them.
I used this blog as a guideline:
System Memory Guidelines for Cassandra AWS
There are working for more than 90 days.
I let Cassandra to manage its memory without any intervention (XMS/XMX) but use G1 Settings.
Since the beginning of the cluster, it made 2-3 major compactions.
The three nodes were balanced (memory and disk space) until few days ago.
The memory usage of Cassandra #3 dropped from 70% to 52%. Disk usage dropped as well.
Everything works normal and I haven't saw anything strange in the log of #3.
Will #1 and #2 will be balanced to #3? Should I do something?

It seems forcing remove tombstones helps:
ALTER TABLE foo.bar WITH compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4', 'unchecked_tombstone_compaction': ‘true’', 'tombstone_threshold': '0.1'};
After a short while, all 3 nodes started a compaction. Before altering the data partition was:
/dev/sdb 4.5T 2.9T 1.7T 63% /cassandra/data
After compaction
/dev/sdb 4.5T 1.8T 2.8T 39% /cassandra/data

Related

Read latency in Cassandra cluster - too many SSTables

We are facing read latency issues on our Cassandra cluster. One of the reason, I read about, is too many SSTables used in read query. As per documents available online, 1-3 SSTables should be queried for 99%ile read queries. However in my case, we are using upto 20 SSTables.
(I have already worked on tuning other parameters like read-ahead, concurrent-read threads etc)
Here is the output of tablehistogram command for one of the table.
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 51.01 43388.63 179 3
75% 14.00 73.46 62479.63 642 12
95% 17.00 126.93 107964.79 5722 124
98% 20.00 152.32 129557.75 14237 310
99% 20.00 182.79 129557.75 24601 535
Min 0.00 14.24 51.01 51 0
Max 24.00 74975.55 268650.95 14530764 263210
First, I thought maybe compaction is lagging, but that is not the case. I checked and there are always 0 pending tasks in the output of compactionstatus command. I increased the compaction throughput and concurrent compactors just to be on the safer side.
CPU usage, memory usage, and disk IO/IOPS are under control.
We are using the default compaction strategy. Here are the table metadata.
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 7776000
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Also, as per compaction history, I can see compaction happening on some tables once in a day, once in 3 days for another table.
Looks like, the SSTable size is not matching to perform the compaction.
Can you please suggest what can be done here to reduce the number of SSTables?
You can make compaction a bit more aggressive by changing min_threshold parameter of the compaction setting. In the default configuration it's waiting until there are at least 4 files of similar size available, and only after that, trigger compaction. Start with 3, maybe you can lower it to 2, but you really need to track resource consumption so compaction won't add a lot of overhead.
Check this document from the DataStax field team who did a lot of tuning for DataStax customers.
You can try to move to level compaction strategy, it's a better one if you have lots of updates. Another option is to force a major compaction.
In ScyllaDB we have incremental compaction strategy which combines the best of size tiered and level triggered.
You need to make sure that your queries are retrieving data from a single partition otherwise they will significantly affect the performance of your cluster.
If your queries target one partition only but still need to retrieve data from 20 SSTables, it indicates to me that you are constantly inserting/updating partitions and the data gets fragmented across multiple SSTables so Cassandra has to retrieve all the fragments and coalesce them to return the results to the client.
If the size of the SSTables are very small (only a few kilobytes) then there's a good chance that your cluster is getting overloaded with writes and nodes are constantly flushing memtables to disk so you end up with tiny SSTables.
If the SSTables are not getting compacted together, it means that the file sizes are widely different. By default, SSTables get merged together if their size in kilobytes are within 0.5-1.5x the average size of SSTables.
Alex Ott's suggestion to reduce min_threshold to 2 will help speed up compactions. But you really need to address the underlying issue. Don't be tempted to run nodetool compact without understanding the consequences and tradeoffs as I've discussed in this post. Cheers!

Cassandra compaction: does replication factor have any influence?

Let’s assume that the total disk usage of all keyspaces is 100GB before replication. The replication factor is 3. Making the total physical disk usage = 100GB x 3 = 300GB.
We use the default compaction strategy (size-tiered) and let’s assume the worse case where Cassandra needs as much free space as the the data to complete the compaction. Does Cassandra needs 100GB (before replication) or 300GB (100GB x3 with replication)?
In other words, when Cassandra needs free disk space for performing compaction, does the replication factor has any influence?
Compaction in Cassandra is local to a Node.
Now let's say you have a 3 node cluster, replication factor is also 3, and the original data size is 100GB. This means that each node has 100GB worth of data.
Hence on each node, I will need 100GB free space to compact the data present on that node.
TLDR; Free space required for Compaction is equal to the total data present on the node.
Because data is replicated between the nodes, every node will need to have up to 100Gb free space - so it's total of the 300Gb, but not on one node...

STCS : how I can improve compaction performance?

I have six nodes Cassandra cluster, which host a large columnfamily (cql table) that is immuable (because it's a kind of an history table from an application point of view). Such table is about 400Go of compressed data, which is not that much!
So after truncating the table, then ingest the app history data in it, I trigger nodetool compact on it on each node, in order to have the best read performance, by reducing down the number of SSTables. The compaction strategy is STCS.
After running nodetool compact, I trigger nodetool compactionstats to follow the compaction progress :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 3.65 GiB 1.11 TiB bytes 0.32%
After hours I have on that same node :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 4.08 GiB 1.11 TiB bytes 0.36%
So the compaction process seems to work, but it's terribly slow.
Even with nodetool setcompactionthreshold -- 0, the compaction remains terribly slow. Moreover, CPU seems to be used to 100% because of that compaction.
Questions :
What are configurations parameters that I can tune to try to boost compaction performance ?
Could the 100% CPU when compaction occurs be related to GC pressure ?
If compaction is too slow, it is relevant to add more nodes, or add more CPU/RAM to each nodes ? Could it help ?
Performance of compaction depends on the underlying hardware - its performance depends on what kind of disks is used, etc. But it also depends on how many compaction threads are allowed to run, and what throughput is configured for compaction threads. From command line compaction throughput is configured by nodetool setcompactionthroughput, not the nodetool setcompactionthreshold as you used. And number of concurrent compactors is set with nodetool setconcurrentcompactors (but it's available in 3.1, IIRC). You can also configure default values in the cassandra.yaml.
So if you have enough CPU power, and good SSD disks, then you can bump compaction throughput, and number of compactors.

New cassandra node should finish compaction before joining ring

I’d like to know if there is a way to have a Cassandra node join the ring only after it has finished streaming and compaction. The issue I’m experiencing is that when I add a node to my cluster, it streams data from the other nodes then joins the ring, at this point it begins a lot of compactions, and the compactions take a very long time to complete (greater than a day), during this time CPU utilization on that node is nearly 100%, and bloom filter false positive ratio is very high as well which happens to be relevant to my use case. This causes the whole cluster to experience an increase in read latency, with the newly joined node in particular having 10x the typical latency for reads.
I read this post http://www.datastax.com/dev/blog/bootstrapping-performance-improvements-for-leveled-compaction which has this snippet about one way to possibly improve read latency when adding a node.
“Operators usually avoid this issue by passing -Dcassandra.join_ring=false when starting the new node and wait for the bootstrap to finish along with the followup compactions before putting the node into the ring manually with nodetool join”
The documentation on the join_ring option is pretty limited but after experimenting with it it seems that streaming data and the later compaction can’t be initiated until after I run nodetool join for the new host, so I’d like to know how or if this can be achieved.
Right now my use case is just deduping records being processed by a kafka consumer application. The table in cassandra is very simple, just a primary key, and the queries are just inserting new keys with a ttl of several days and checking existence of a key. The cluster needs to perform 50k reads and 50k writes per second at peak traffic.
I’m running cassandra 3.7 My cluster is in EC2 originally on 18 m3.2xlarge hosts. Those hosts were running at very high (~90%) CPU utilization during compactions which was the impetus for trying to add new nodes to the cluster, I’ve since switched to c3.4xlarge to give more CPU without having to actually add hosts, but it’d be helpful to know at what CPU threshold I should be adding new hosts since waiting until 90% is clearly not safe, and adding new hosts exacerbates the CPU issue on the newly added host.
CREATE TABLE dedupe_hashes (
app int,
hash_value blob,
PRIMARY KEY ((app, hash_value))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '90PERCENTILE';
One thing to check that may help avoid the 100% CPU due to compactions is the setting for "compaction_throughput_mb_per_sec".
By default it should be 16MB but check if you have this disabled (set to 0).
nodetool getcompactionthroughput
You can also set the value dynamically
nodetool setcompactionthroughput 16
(set in cassandra.yaml to persist after restart).

Cassandra taking too much memory for the data being written

We are using Apache Cassandra 3.0.7 version and off late we see that 90% of memory is occupied on almost all nodes, even though disk is hardly used. We have a cluster of 5 nodes with 15 GB memory, 4 cores, 200 GB SSD each.
We tried all kind of configurations through both YAML as well as table level properties but none seem to help. Memory usage constantly increases almost in direct proportion to data.
Considering the fact that our application is a write-intensive one, we are okay with reduced read performance but would like to utilize as less memory as possible. To do this, our idea was to disable all caches possible or avoid keeping anything not-necessary in memory. But nothing so far seem to help.
​Here's our yaml: http://pastebin.com/yeRGcHRt
and here's our table configuration:
CREATE KEYSPACE if not exists test_ks WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'}; CREATE TABLE if not
exists test_ks.test_cf (id bigint PRIMARY KEY,key_str text,value1
int,value2 int,update_ts bigint) WITH bloom_filter_fp_chance = 1 AND
comment = '' AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'} AND compression =
{'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance
= 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 10240 AND
memtable_flush_period_in_ms = 3600000 AND min_index_interval = 10240
AND read_repair_chance = 0.0 AND speculative_retry = '99PERCENTILE'
AND caching = {'keys': 'NONE', 'rows_per_partition': 'NONE'};
We have seen that most of the consumption is on off-heap, heap memory is capped at 4.5 G. So out of total 14 G on a node, only 4.5G is consumed by heap.
Has anyone tried such configuration before? Please let us know if disabling cache would help us in this situation. And if yes, how we can we disable cache completely. Looking forward to your help.
We are experiencing a similar problem. After upgrading from Cassandra 2.x to 3.11.0, Cassandra is using <2GB on-heap and >10GB off-heap, on a use case that didn't have any problems before. This results in the (Windows) machine staying pegged at 99.5% memory usage continually. Heap memory is similarly capped at 2GB.
Most caching values are left to the defaults; in particular the row cache is disabled.
EDIT: I have a better answer. It appears (still testing) that the slowness in our case was because Windows' page file was not disabled. Cassandra recommends disabling the swap file on Linux or the page file on Windows. It also outputs a warning on startup if a swap or page file is detected.
Cassandra's off-heap memory, at least on Windows, is mostly due to memory-mapped IO of files, which is apparently (from reading the Cassandra issue tracker) significantly faster. However, if a swap/page file is enabled, things are forced out of physical memory by mmapped files and experience a huge slowdown swapping to disk. Disabling the page file on Windows in our testing appears to mitigate this significantly. Cassandra is still using lots of memory for mmapped files, but as no memory is being swapped to disk, some combination of Cassandra and the OS properly free up the mmapped files so that other processes can run smoothly. I used this tool to confirm the presence of mmapped files on Windows.
Try set -XX:MaxDirectMemorySize. It will limit the use of off-heap memory
To decrease used memory try to set next parameters
MAX_HEAP_SIZE, HEAP_NEWSIZE
in cassandra-env.sh to values you want

Resources