Major compaction in cassandra - cassandra

I have a 4 node brisk cluster with 2 Cassandra nodes in Cassandra DC and 2 brisk nodes in Brisk DC. I stress tested this set up using stress tool which is being shipped along with cassandra for 10 Million writes
On executing
$ ./nodetool -h x.x.x.x compactionstats
pending tasks: 17
compaction type keyspace column family bytes compacted bytes total progress
Major Keyspace1 Standard1 45172473 60278166 74.94%
AFAIK major compaction is manually triggered from node tool. But I'm able to see that it has been triggered automatically.
Is this a desired behavior? If so what are all the situations this may occur?
Regards,
Tamil

From the doc:
Compactions are triggered when at least N SStables have been flushed
to disk, where N is tunable and defaults to 4.
"Minor" compactions merge sstables of similar size; "major" compactions merge all sstables in a given ColumnFamily.
Again from the doc:
A major compaction is triggered either via nodeprobe, or automatically:
Nodeprobe sends TreeRequest messages to all neighbors of the target
node: when a node receives a TreeRequest, it will perform a readonly
compaction to immediately validate the column family.
Automatic compactions will also validate a column family and broadcast
TreeResponses, but since TreeRequest messages are not sent to
neighboring nodes, repairs will only occur if two nodes happen to
perform automatic compactions within TREE_STORE_TIMEOUT of one
another.
You may find more info here and here

Related

why full replication Cassandra cluster have node data size difference

I have a 3-node cassandra cluster (version 3.11.11) with replication factor 3. only 2 of the nodes are receiving requests, and Node3 only sync with the other 2 nodes.
In theory, each node should have the same data size. But in practice, I end up with nodes with different data sizes as shown in the picture.
we have daily nodetool repair, operations like compaction are done automatically with default settings.
What can be the reason for the size difference?
It finally ends up how data gets compacted in the long run. Since compaction is local process and how sstables can be stacked up cannot be guaranteed. So I dont see any abbreviation here. Theory just say all nodes will have same data logically but physically it may vary. For example in node3 you may have old sstables that are not getting compacted due to size (if using STCS) and in other nodes they have compacted and reduced the size of those nodes.

Cassandra: Are SS table created for replicated data

Suppose we have 3 node cluster in Cassandra and replication factor is 2.
During the compaction process, the replicated data of some other primary node also get compacted during the node compaction process.
For example, If node 1 is range partitioned for 1-10 and node 2 and node 3 are replicas for node 1. When we initiate the compaction process on node 2, will the SS table of node 2 have replicated the replication data of node 1?
TIA
The data replication happens either in real-time, when you write data, or via hints, if node was offline, or via explicit repair operation (although repair is the special type of compaction).
Compaction process is independent on every node - there are different factors that could lead to flushes happening on different times, different size of the data, etc. Compaction of data on one node won't send data to other nodes, until it's a validation compaction triggered via repair.
I recommend to read DSE Architecture guide, that explains how data is replicated, etc.
P.S. in your example, if you have RF=2, then only one node will be the replica for Node1...

STCS : how I can improve compaction performance?

I have six nodes Cassandra cluster, which host a large columnfamily (cql table) that is immuable (because it's a kind of an history table from an application point of view). Such table is about 400Go of compressed data, which is not that much!
So after truncating the table, then ingest the app history data in it, I trigger nodetool compact on it on each node, in order to have the best read performance, by reducing down the number of SSTables. The compaction strategy is STCS.
After running nodetool compact, I trigger nodetool compactionstats to follow the compaction progress :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 3.65 GiB 1.11 TiB bytes 0.32%
After hours I have on that same node :
id compaction type keyspace table completed total unit progress
xxx Compaction mykeyspace mytable 4.08 GiB 1.11 TiB bytes 0.36%
So the compaction process seems to work, but it's terribly slow.
Even with nodetool setcompactionthreshold -- 0, the compaction remains terribly slow. Moreover, CPU seems to be used to 100% because of that compaction.
Questions :
What are configurations parameters that I can tune to try to boost compaction performance ?
Could the 100% CPU when compaction occurs be related to GC pressure ?
If compaction is too slow, it is relevant to add more nodes, or add more CPU/RAM to each nodes ? Could it help ?
Performance of compaction depends on the underlying hardware - its performance depends on what kind of disks is used, etc. But it also depends on how many compaction threads are allowed to run, and what throughput is configured for compaction threads. From command line compaction throughput is configured by nodetool setcompactionthroughput, not the nodetool setcompactionthreshold as you used. And number of concurrent compactors is set with nodetool setconcurrentcompactors (but it's available in 3.1, IIRC). You can also configure default values in the cassandra.yaml.
So if you have enough CPU power, and good SSD disks, then you can bump compaction throughput, and number of compactors.

What does Cassandra nodetool repair exactly do?

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?
The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

Cassandra high number of SSTables

After launching some long running write jobs (batch insert from an Apache Spark Job with Spark Cassandra Connector), Cassandra (v. 2.1) created thousands of SSTables for the target table (more than 4500).
The minor compaction thresholds are set to the default values (4 to 32). This means that, in theory, a lot of minor compaction tasks should be scheduled automatically.
I checked the status and nodetool indicated that no tasks were being scheduled. I stopped doing any operation for few hours. Then I restarted the cluster multiple times. Waited some more time. Disabled and re-enabled autocompaction. Waited. Increased the throughput to 999 MB/s. Waited.
During these tests, just few minor compaction were randomly started in some nodes for a limited period of time. Most of the nodes have been doing nothing for an entire day.
Then, I decided to manually launch a Major compaction (it is going to take days... Amazon EBS).
Why is Cassandra not doing any minor auto-compaction, even if the number of SSTables is 100 times greater than the threshold (32) ?
The answer is in the documentation:
By default, a minor compaction can begin any time Cassandra creates four SSTables on disk for a column family. A minor compaction must begin before the total number of SSTables reaches 32.
The total number of my SSTables is fairly greater than 32...

Resources