Cassandra Compaction vs Repair vs Cleanup - cassandra

After posting a question and reading this and that articles, I still do not understand the relations between those three operations-
Cassandra compaction tasks
nodetool repair
nodetool cleanup
Is repair task can be processed while compaction task is running, or cleanup while compaction task is running? Is cleanup is a operation that need to be executed weekly as repair? Why repair operation need to be executed manually and it is not in Cassandra default behavior?
What is the ground rules for healthy cluster maintenance?

A cleanup is a compaction that just removes things outside the nodes token range(s). A repair has a "Validation Compaction" to build a merkle tree to compare with the other nodes, so part of nodetool repair will have a compaction.
Is repair task can be processed while compaction task is running, or cleanup while compaction task is running?
There is a shared pool of for the compactions across normal compactions, repairs, cleanups, scrubs etc. This is the concurrent_compactors setting in the cassandra.yaml that defaults to a combination of the number of cores and data directories: https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L572
Is cleanup is a operation that need to be executed weekly as repair?
no, only after topology changes really.
Why repair operation need to be executed manually and it is not in Cassandra default behavior?
Its manual because its requirements can differ a lot on what your data and gc_grace requirements are. https://issues.apache.org/jira/browse/CASSANDRA-10070 is bringing it inside Cassandra though so in the future it will be automatic.
What is the ground rules for healthy cluster maintenance?
I would (opinion) say:
Regular backups (depending on requirements, and acceptable data loss
this can be anything from weekly/daily to constantly with incremental).
This is just as much for "internal" mistakes ("Opps i deleted a customer") as outages. Even with strong multi-dc replication you want some minimum backups.
Making sure a Repair completes for all tables that have deletes at least once within the gc_grace time of those tables.
Metric and log storage pretty important if you want to be able to debug issues.

Related

How often should I run nodetool compact and repair in Cassandra?

We have 14 node cassandra cluster v 3.5.
Can someone enlighten with compact & repair ?
If I am running from one of node, does this needs to be runs from all the nodes in cluster
nodetool compact
I see it is very slow, how often this supposed to be run ?
Same question regarding nodetool repair ( All nodes or certain nodes in cluster)
nodetool repair or
nodetool repair -pr
how often this supposed to be run ?
Compactions are part of the normal operation of Cassandra nodes. They run automatically in the background (otherwise known as minor compactions) and get triggered by each table's defined compaction strategy based on any combination of configured thresholds and compaction sub-properties. This video extract from the DS201 Cassandra Foundations course at the DataStax Academy talks about compactions in more detail.
It is not necessary for an operator/administrator to manually kick off compactions with nodetool compact. In fact, it is not recommended to trigger user-defined compactions (otherwise known as major compactions) because they can create problems down the track like the one I explained in this post -- https://community.datastax.com/questions/6396/.
Repairs on the other hand is something that needs to be managed by a cluster administrator. Since Cassandra has a distributed architecture, it is necessary to run repairs to keep the copies of the data consistent between replicas (Cassandra nodes).
Repairs need to be run at least once every gc_grace_seconds (configured per table). By default, GC grace is 10 days (864000 seconds) so most DB admins run a repair on each node once a week. This short video from the DS210 Cassandra Operations course provides a good overview of Cassandra repairs.
Running a partitioner range repair (with -pr flag) on a node repairs only the data that a node owns so it is necessary to run nodetool repair -pr on each node, one node at a time, until all nodes in the cluster have been repaired. This blog post by Jeremiah Jordan is a good explanation of why this is necessary.
If you're interested, datastax.com/dev has free resources for learning Cassandra. The Cassandra Fundamentals series in particular is a good place to start. It is a collection of short online tutorials where you can quickly learn the basic concepts for free. Cheers!
If I am running from one of node, does this needs to be runs from all the nodes in cluster nodetool compact I see it is very slow, how often this supposed to be run ?
You should not run nodetool compact command generally. Compactions are by default meant to run automatically behind the scene if not disabled. Running compaction manually may create more problems and should be avoided for most of the cases. Auto compactions which run behind the scene should be able to handle your compactions. If you feel your compactions are slow you can tune your compactions by looking after the parameters related to compaction here (Mostly concurrent_compactors and compaction_throughput_mb_per_sec)
Same question regarding nodetool repair ( All nodes or certain nodes
in cluster) nodetool repair or nodetool repair -pr how often this
supposed to be run ?
Repair is a maintenance task which should be run on all then nodes once before each gc_grace_seconds period. For example default gc_grace_seconds is equal to 10 days so it is required to run repair on all the nodes once in this 10 day period. You should schedule your repair to run regularly once in gc_grace_seconds period. Regarding which option to use for running repair. If you are doing it by yourself you should run nodetool repair -pr on all the nodes one by one.

Guideline regarding nodetool repair in apache cassandra v3.0.9

We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.
Is it still required to run full repair before gc_grace_seconds or just the incremental repairs are good enough in apache cassandra v3.0.9
Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?
What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.
What are the downsides of not scheduling repairs at all?
We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?
full repair is still needed. incremental repair would split SSTables into "repaired" and "unrepaired" parts and the "repaired" part will not be repaired later and that's why incremental repair is more efficient. However, if there're data corruption in the "repaired" sstables, only full repair can fix that; our experience is to run incremental repair every day and full repair only once per grace period. Also, when you have incremental repair, you can make the grace period longer.
better run incremental repair on each node one by one; you can have a cron job or code a simple scheduler to do that;
repair failure, just run it again; no side effect I know of.
if you don't do repair, as time goes, you data consistency is at danger; Cassandra takes the eventual consistency concept, which means that it doesn't guarantee strong consistency when you write data to it unless you explicitly specify that. repair is very important to guarantee the data in the background are all kept up to date and consistent;
if you run full repair already in your cluster, you shouldn't need to migrate explicitly.

Is it recommended to do periodic cassandra repair

We recently had a disk fail in one of our Cassandra node (its a 5 Cassandra 2.2 cluster with replication factor of 3). It took about a week or more to perform a full repair on that node. Each node contains 3/5 of the data and doing nodetool repair repaired 3/5 of the token ranges across all nodes. Now that its been repaired it will most likely repair faster since it did a incremental repair. I am wondering if its a good idea to perform periodic repairs on all nodes using nodetool repair -pr (We are at 2.2 and I think incremental repair is default in 2.2).
I think its a good idea because if performed periodically it will take less time to repair as it only needs to repair non repaired SStables. We also might've had instances where the nodes may've been down for more than the hinted handoff window and we probably didn't do anything about it.
Yes, its good practice to run scheduled incremental repair. Run repair frequently enough that every node is repaired before reaching the time specified in the gc_grace_seconds setting.
Also, it would be great if you run incremental repair on a frequent basis, combined with full repair less frequently like once per month/week. incremental repair would repair the SSTable which was not marked as repaired before, but full repair could take care more comprehensive case like SSTable rotting. check the reference from datastax:https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesWhen.html

What does Cassandra nodetool repair exactly do?

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?
The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

What options are there to speed up a full repair in Cassandra?

I have a Cassandra datacenter which I'd like to run a full repair on. The datacenter is used for analytics/batch processing and I'm willing to sacrifice latencies to speed up a full repair (nodetool repair). Writes to the datacenter is moderate.
What are my options to make the full repair faster? Some ideas:
Increase streamthroughput?
I guess I could disable autocompation and decrase compactionthroughput temporarily. Not sure I'd want to that, though...
Additional information:
I'm running SSDs but haven't spent any time adjusting cassandra.yaml for this.
Full repairs are run sequentially by default. The state and differences of the nodes' datasets are stored in binary trees. Recreating these is the main factor here. According to this datastax blog entry, "Every time a repair is carried out, the tree has to be calculated, each node that is involved in the repair has to construct its merkle tree from all the sstables it stores making the calculation very expensive."
The only way I see to significantly increase the speed of a full repair is to run it in parallel or repair subrange by subrange. Your tag implies that you run Cassandra 2.0.
1) Parallel full repair
nodetool repair -par, or --parallel, means carry out a parallel repair.
According to the nodetool documentation for Cassandra 2.0
Unlike sequential repair (described above), parallel repair constructs the Merkle tables for all nodes at the same time. Therefore, no snapshots are required (or generated). Use a parallel repair to complete the repair quickly or when you have operational downtime that allows the resources to be completely consumed during the repair.
2) Subrange repair
nodetool accepts start and end token parameters like so
nodetool repair -st (start token) -et (end token) $keyspace $columnfamily
For simplicity sake, check out this python script that calculates tokens for you and executes the range repairs:
https://github.com/BrianGallew/cassandra_range_repair
Let me point out two alternative options:
A) Jeff Jirsa pointed to incremental repairs.
These are available starting with Cassandra 2.1. You will need to perform certain migration steps before you can use nodetool like this:
nodetool repair -inc, or --incremental means do an incremental repair.
B) OpsCenter Repair Service
For the couple of clusters at my company itembase.com, we use the repair service in DataStax OpsCenter which is executing and managing small range repairs as a service.

Resources