After launching some long running write jobs (batch insert from an Apache Spark Job with Spark Cassandra Connector), Cassandra (v. 2.1) created thousands of SSTables for the target table (more than 4500).
The minor compaction thresholds are set to the default values (4 to 32). This means that, in theory, a lot of minor compaction tasks should be scheduled automatically.
I checked the status and nodetool indicated that no tasks were being scheduled. I stopped doing any operation for few hours. Then I restarted the cluster multiple times. Waited some more time. Disabled and re-enabled autocompaction. Waited. Increased the throughput to 999 MB/s. Waited.
During these tests, just few minor compaction were randomly started in some nodes for a limited period of time. Most of the nodes have been doing nothing for an entire day.
Then, I decided to manually launch a Major compaction (it is going to take days... Amazon EBS).
Why is Cassandra not doing any minor auto-compaction, even if the number of SSTables is 100 times greater than the threshold (32) ?
The answer is in the documentation:
By default, a minor compaction can begin any time Cassandra creates four SSTables on disk for a column family. A minor compaction must begin before the total number of SSTables reaches 32.
The total number of my SSTables is fairly greater than 32...
Related
Having 18 nodes of Cassandra cluster in production, I need to reduce repair time using the reaper, I have scheduled an incremental repair using Reaper version 2.2.3 with the following values:
Segment count per node 16
Intensity 0.94
Repair threads 3
Each node has 4 CPU cores, so I can't increase the number of repair threads further.
In the config file of Reaper (cassandra-reaper.yaml) I can see the following values:
segmentCountPerNode: 32
repairIntensity: 0.9
scheduleDaysBetween: 7
repairRunThreadCount: 15
hangingRepairTimeoutMins: 240
incrementalRepair: true
maxParallelRepairs: 2
Can I change the value of the above parameters to reduce the timing of the whole repair process?
Since I used incremental repair, my expectation was repairing each node takes less than an hour not more than 3 hours!
One aspect that weighs in here, is the amount of data on each node. That comes into play if you're being bottlenecked on either network or disk I/O, and makes a HUGE impact when it comes to streaming data (for repairs).
So if you have (for example) 18 nodes # 500GB each, doubling your node count to have 36 nodes # 250GB should help. Yes, it should take the same exact amount of time. But repair streams on smaller nodes are much less likely hang.
We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.
Is it still required to run full repair before gc_grace_seconds or just the incremental repairs are good enough in apache cassandra v3.0.9
Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?
What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.
What are the downsides of not scheduling repairs at all?
We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?
full repair is still needed. incremental repair would split SSTables into "repaired" and "unrepaired" parts and the "repaired" part will not be repaired later and that's why incremental repair is more efficient. However, if there're data corruption in the "repaired" sstables, only full repair can fix that; our experience is to run incremental repair every day and full repair only once per grace period. Also, when you have incremental repair, you can make the grace period longer.
better run incremental repair on each node one by one; you can have a cron job or code a simple scheduler to do that;
repair failure, just run it again; no side effect I know of.
if you don't do repair, as time goes, you data consistency is at danger; Cassandra takes the eventual consistency concept, which means that it doesn't guarantee strong consistency when you write data to it unless you explicitly specify that. repair is very important to guarantee the data in the background are all kept up to date and consistent;
if you run full repair already in your cluster, you shouldn't need to migrate explicitly.
Recently we inserted millions of records and deleted millions of records from a table, a table of size 10 GB was truncated.
We are running with 2 nodes with SizeTieredCompactionStrategy, currently CPU utilization is 100% and pending compaction is increasing , currently pending compaction is 293144
Any pointers to reduce CPU utilization and get the compaction done quickly.
reduce CPU utilization and get the compaction done quickly.
These two things are orthogonal. You can either accelerate the compaction (by using more resources) or limit the resources for the compactions so that your writes aren't affected but have it take longer.
If you have an ingest running against your cassandra cluster, I would try to ensure that it is not affected by your compactions. As long as the # of pending compactions is decreasing over time it's just a matter of time.
If you don't have reads or writes coming in (I.E. downtime or you're bootstrapping) it's okay to let compactions use up all your resources and finish fast.
How?
The levers are:
1) get/set compaction throughput (nodetool)-- only kicks in for the next available compaction. This is how fast the compaction will occur. Default is 16 mb/s if you have resources available, you can increase this to a larger number.
2) concurrent compactors -- there are 2 values you have to set in JMX. you can do this on the fly using jmxsh or jconsole, etc. This is the number of compactions you can run at a time (number of cores).
Monitoring
Watch nodetool compactionstats or OpsCenter (you can also chart pending compactions and set alerts) to find out the progress for the current compactions or nodetool comactionhistory for completed compactions.
Other things
a table of size 10 GB was truncated.
Truncates are free, no compaction needed.
We have a couple of tables with Leveled compaction strategy and SizeTiered compaction strategy. How often do we need to run compaction? Thanks in advance
TL;DR
Compaction runs on its own (as long as you did not disable autocompaction in the yaml).
Compaction - what is it?
Per the cassandra write path, we flush memtables to disk periodically into SSTables (sorted string tables) which are immutable. When you update an existing cell, it eventually gets written in an sstable. Possibly a different one than the original record. When we read, sometimes C* has to scan across various sstables (with some optimizations, see bloom filters) to find the latest version of a cell. In Cassandra, last write wins.
Compaction takes sstables and compacts them together removing duplicate data, to optimize reads. This is an automatic operation, though you can tune compactions to run more or less often.
Some useful details on Compaction
Size tiered compaction is the default compaction strategy in cassandra, it looks for sstables that are the same size and compacts them together when it finds enough (4 by default). Size tiered is less IO intensive than leveled and will work better in general when you have smaller boxes and rotational drives.
Leveled compaction is optimized for reads, when you have read heavy workloads or tight read SLA's with lots of updates leveled may make sense. Leveled compaction is more IO and CPU intensive because you are spending more cycles optimizing for reads, but the reads themselves should be faster and hit fewer SStables. Keep an eye on io wait and on pending compactions in nodetool compaction stats when you first enable these or if your workload grows.
Compaction Tunables / Levers
multi threaded compaction - turn it off, the overhead is bigger than the benefit. To the point where it's been removed in C* 2.1.
concurrent compactors - now defaults to 2, used to default to number of cores which is a bad default. If you're on the 2.0 branch and not running the latest DSE check this default and consider decreasing it to 2. this is the number of simultaneous compaction tasks you can run (different column families).
Compaction throttling - a way of limiting the amount of resources that compactions take up. You can tune this on the fly with nodetool getcompactionthreshold and nodetool setcompactionthreshold. You want to tune this to a point where you are not accumulating pending tasks. 0 --> unlimited. Unlimited is, unintuitively, not usually the fastest setting as the system may get bogged down.
I have a 4 node brisk cluster with 2 Cassandra nodes in Cassandra DC and 2 brisk nodes in Brisk DC. I stress tested this set up using stress tool which is being shipped along with cassandra for 10 Million writes
On executing
$ ./nodetool -h x.x.x.x compactionstats
pending tasks: 17
compaction type keyspace column family bytes compacted bytes total progress
Major Keyspace1 Standard1 45172473 60278166 74.94%
AFAIK major compaction is manually triggered from node tool. But I'm able to see that it has been triggered automatically.
Is this a desired behavior? If so what are all the situations this may occur?
Regards,
Tamil
From the doc:
Compactions are triggered when at least N SStables have been flushed
to disk, where N is tunable and defaults to 4.
"Minor" compactions merge sstables of similar size; "major" compactions merge all sstables in a given ColumnFamily.
Again from the doc:
A major compaction is triggered either via nodeprobe, or automatically:
Nodeprobe sends TreeRequest messages to all neighbors of the target
node: when a node receives a TreeRequest, it will perform a readonly
compaction to immediately validate the column family.
Automatic compactions will also validate a column family and broadcast
TreeResponses, but since TreeRequest messages are not sent to
neighboring nodes, repairs will only occur if two nodes happen to
perform automatic compactions within TREE_STORE_TIMEOUT of one
another.
You may find more info here and here