CPU 100% due to thousands of Pending Compaction - cassandra

Recently we inserted millions of records and deleted millions of records from a table, a table of size 10 GB was truncated.
We are running with 2 nodes with SizeTieredCompactionStrategy, currently CPU utilization is 100% and pending compaction is increasing , currently pending compaction is 293144
Any pointers to reduce CPU utilization and get the compaction done quickly.

reduce CPU utilization and get the compaction done quickly.
These two things are orthogonal. You can either accelerate the compaction (by using more resources) or limit the resources for the compactions so that your writes aren't affected but have it take longer.
If you have an ingest running against your cassandra cluster, I would try to ensure that it is not affected by your compactions. As long as the # of pending compactions is decreasing over time it's just a matter of time.
If you don't have reads or writes coming in (I.E. downtime or you're bootstrapping) it's okay to let compactions use up all your resources and finish fast.
How?
The levers are:
1) get/set compaction throughput (nodetool)-- only kicks in for the next available compaction. This is how fast the compaction will occur. Default is 16 mb/s if you have resources available, you can increase this to a larger number.
2) concurrent compactors -- there are 2 values you have to set in JMX. you can do this on the fly using jmxsh or jconsole, etc. This is the number of compactions you can run at a time (number of cores).
Monitoring
Watch nodetool compactionstats or OpsCenter (you can also chart pending compactions and set alerts) to find out the progress for the current compactions or nodetool comactionhistory for completed compactions.
Other things
a table of size 10 GB was truncated.
Truncates are free, no compaction needed.

Related

Cassandra read latency increases while writing

I have a cassandra cluster, its read latency increases during writes. The writes mostly happen via spark jobs during the night time. The writes happen in huge bursts, is there a way to reduce read latency during the writes. The writes happen using LOCAL_QUORUM and reads happen using LOCAL_ONE. Is there a way to reduce read latency when writes are happening?
Cassandra Cluster Configs
10 Node cassandra cluster (5 in DC1, 5 in DC2)
CPU: 8 Core
Memory: 32GB
Grafana Metrics
I can give some advice:
Use LCS compaction strategy.
Prefer round-robin load balancing policy for reads.
Choose partition_key wisely so that requests are not bombarded on a single partition.
Partition size also play a good role. Cassandra recommends to have smaller partition size. However, I have tested with Partitions of 10000 rows each with each row having size of 800 bytes. It worked better than with 3000 rows(or even 1 row). Very tiny partitions tend to increase CPU usage when data stored is large in terms of row count. However, very large partitions should be avoided even.
Replication Factor should be chosen strategically . Write consistency level should be decided considering the replication of all keyspaces.

STCS : how I can improve compaction performance?

I have six nodes Cassandra cluster, which host a large columnfamily (cql table) that is immuable (because it's a kind of an history table from an application point of view). Such table is about 400Go of compressed data, which is not that much!
So after truncating the table, then ingest the app history data in it, I trigger nodetool compact on it on each node, in order to have the best read performance, by reducing down the number of SSTables. The compaction strategy is STCS.
After running nodetool compact, I trigger nodetool compactionstats to follow the compaction progress :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 3.65 GiB 1.11 TiB bytes 0.32%
After hours I have on that same node :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 4.08 GiB 1.11 TiB bytes 0.36%
So the compaction process seems to work, but it's terribly slow.
Even with nodetool setcompactionthreshold -- 0, the compaction remains terribly slow. Moreover, CPU seems to be used to 100% because of that compaction.
Questions :
What are configurations parameters that I can tune to try to boost compaction performance ?
Could the 100% CPU when compaction occurs be related to GC pressure ?
If compaction is too slow, it is relevant to add more nodes, or add more CPU/RAM to each nodes ? Could it help ?
Performance of compaction depends on the underlying hardware - its performance depends on what kind of disks is used, etc. But it also depends on how many compaction threads are allowed to run, and what throughput is configured for compaction threads. From command line compaction throughput is configured by nodetool setcompactionthroughput, not the nodetool setcompactionthreshold as you used. And number of concurrent compactors is set with nodetool setconcurrentcompactors (but it's available in 3.1, IIRC). You can also configure default values in the cassandra.yaml.
So if you have enough CPU power, and good SSD disks, then you can bump compaction throughput, and number of compactors.

Do we need to run manual compaction with Leveled compaction strategy and SIzeTiered compaction strategy

We have a couple of tables with Leveled compaction strategy and SizeTiered compaction strategy. How often do we need to run compaction? Thanks in advance
TL;DR
Compaction runs on its own (as long as you did not disable autocompaction in the yaml).
Compaction - what is it?
Per the cassandra write path, we flush memtables to disk periodically into SSTables (sorted string tables) which are immutable. When you update an existing cell, it eventually gets written in an sstable. Possibly a different one than the original record. When we read, sometimes C* has to scan across various sstables (with some optimizations, see bloom filters) to find the latest version of a cell. In Cassandra, last write wins.
Compaction takes sstables and compacts them together removing duplicate data, to optimize reads. This is an automatic operation, though you can tune compactions to run more or less often.
Some useful details on Compaction
Size tiered compaction is the default compaction strategy in cassandra, it looks for sstables that are the same size and compacts them together when it finds enough (4 by default). Size tiered is less IO intensive than leveled and will work better in general when you have smaller boxes and rotational drives.
Leveled compaction is optimized for reads, when you have read heavy workloads or tight read SLA's with lots of updates leveled may make sense. Leveled compaction is more IO and CPU intensive because you are spending more cycles optimizing for reads, but the reads themselves should be faster and hit fewer SStables. Keep an eye on io wait and on pending compactions in nodetool compaction stats when you first enable these or if your workload grows.
Compaction Tunables / Levers
multi threaded compaction - turn it off, the overhead is bigger than the benefit. To the point where it's been removed in C* 2.1.
concurrent compactors - now defaults to 2, used to default to number of cores which is a bad default. If you're on the 2.0 branch and not running the latest DSE check this default and consider decreasing it to 2. this is the number of simultaneous compaction tasks you can run (different column families).
Compaction throttling - a way of limiting the amount of resources that compactions take up. You can tune this on the fly with nodetool getcompactionthreshold and nodetool setcompactionthreshold. You want to tune this to a point where you are not accumulating pending tasks. 0 --> unlimited. Unlimited is, unintuitively, not usually the fastest setting as the system may get bogged down.

Cassandra high number of SSTables

After launching some long running write jobs (batch insert from an Apache Spark Job with Spark Cassandra Connector), Cassandra (v. 2.1) created thousands of SSTables for the target table (more than 4500).
The minor compaction thresholds are set to the default values (4 to 32). This means that, in theory, a lot of minor compaction tasks should be scheduled automatically.
I checked the status and nodetool indicated that no tasks were being scheduled. I stopped doing any operation for few hours. Then I restarted the cluster multiple times. Waited some more time. Disabled and re-enabled autocompaction. Waited. Increased the throughput to 999 MB/s. Waited.
During these tests, just few minor compaction were randomly started in some nodes for a limited period of time. Most of the nodes have been doing nothing for an entire day.
Then, I decided to manually launch a Major compaction (it is going to take days... Amazon EBS).
Why is Cassandra not doing any minor auto-compaction, even if the number of SSTables is 100 times greater than the threshold (32) ?
The answer is in the documentation:
By default, a minor compaction can begin any time Cassandra creates four SSTables on disk for a column family. A minor compaction must begin before the total number of SSTables reaches 32.
The total number of my SSTables is fairly greater than 32...

Explanation required for a statement in Cassandra documentation

I was going through the DataStax documentation and found an interesting statement.
It claimed "Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound".
Can someone explain about how this claim is made? and what might be causing this behavior of Cassandra??
Thanks.
For different workloads, Cassandra clusters can be CPU, memory, I/O or (occasionally) network bound. The claim in the documentation is, if you start a new cluster and make lots of inserts, the cluster will initially be CPU bound but after a while it becomes bottlenecked on memory.
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
There are a few reasons why more memory will help at this stage:
If Cassandra is low on heap space, it will flush memtables when they are smaller. This creates smaller SSTables so more work to compact them.
If the workload involves overwrites or inserts to the same row at different times, it is much cheaper to do this if the row is still in a current memtable. If not, the overwrite and new column is stored in a new memtable, then flushed and merged during compaction. So again, less memory means more compaction work.
Your OS uses memory to buffer reads and writes during compaction. If the OS can't then there will be extra I/O, slowing down memtable flushing and compaction.
Inserts into Cassandra consume lots of Java objects so create work for the garbage collector. If the heap is too small inserts may be paused while GC runs to make some free heap. (On the other hand, if the heap is too large, inserts may be paused for a few seconds during stop-the-world GC.)
So inserts may become memory bound, but they could also become I/O bound. If there isn't enough I/O to flush memtables then inserts will become blocked once the memtable flush queue is full. So I think the claim could be a bit more accurate:
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory or I/O bound.

Resources