What does Cassandra nodetool repair exactly do? - cassandra

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?

The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

Related

Guideline regarding nodetool repair in apache cassandra v3.0.9

We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.
Is it still required to run full repair before gc_grace_seconds or just the incremental repairs are good enough in apache cassandra v3.0.9
Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?
What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.
What are the downsides of not scheduling repairs at all?
We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?
full repair is still needed. incremental repair would split SSTables into "repaired" and "unrepaired" parts and the "repaired" part will not be repaired later and that's why incremental repair is more efficient. However, if there're data corruption in the "repaired" sstables, only full repair can fix that; our experience is to run incremental repair every day and full repair only once per grace period. Also, when you have incremental repair, you can make the grace period longer.
better run incremental repair on each node one by one; you can have a cron job or code a simple scheduler to do that;
repair failure, just run it again; no side effect I know of.
if you don't do repair, as time goes, you data consistency is at danger; Cassandra takes the eventual consistency concept, which means that it doesn't guarantee strong consistency when you write data to it unless you explicitly specify that. repair is very important to guarantee the data in the background are all kept up to date and consistent;
if you run full repair already in your cluster, you shouldn't need to migrate explicitly.

Cassandra Compaction vs Repair vs Cleanup

After posting a question and reading this and that articles, I still do not understand the relations between those three operations-
Cassandra compaction tasks
nodetool repair
nodetool cleanup
Is repair task can be processed while compaction task is running, or cleanup while compaction task is running? Is cleanup is a operation that need to be executed weekly as repair? Why repair operation need to be executed manually and it is not in Cassandra default behavior?
What is the ground rules for healthy cluster maintenance?
A cleanup is a compaction that just removes things outside the nodes token range(s). A repair has a "Validation Compaction" to build a merkle tree to compare with the other nodes, so part of nodetool repair will have a compaction.
Is repair task can be processed while compaction task is running, or cleanup while compaction task is running?
There is a shared pool of for the compactions across normal compactions, repairs, cleanups, scrubs etc. This is the concurrent_compactors setting in the cassandra.yaml that defaults to a combination of the number of cores and data directories: https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L572
Is cleanup is a operation that need to be executed weekly as repair?
no, only after topology changes really.
Why repair operation need to be executed manually and it is not in Cassandra default behavior?
Its manual because its requirements can differ a lot on what your data and gc_grace requirements are. https://issues.apache.org/jira/browse/CASSANDRA-10070 is bringing it inside Cassandra though so in the future it will be automatic.
What is the ground rules for healthy cluster maintenance?
I would (opinion) say:
Regular backups (depending on requirements, and acceptable data loss
this can be anything from weekly/daily to constantly with incremental).
This is just as much for "internal" mistakes ("Opps i deleted a customer") as outages. Even with strong multi-dc replication you want some minimum backups.
Making sure a Repair completes for all tables that have deletes at least once within the gc_grace time of those tables.
Metric and log storage pretty important if you want to be able to debug issues.

Datastax Cassandra Remove and cleanup one column family

After some IT cleanup, we are noticing that we should probably do a full cleanup / restore for one column family. We believe that Cassandra has duplicate data that it is not cleaning up. Is it possible to clear out and just have Cassandra rebuild a single column family from scratch or a snapshot?
During an upgrade some of the nodes decided to rejoin the cluster, rather than just restarting. During that process nodetool netstats showed that nodes where transferring new data file into the original nodes. The cluster is stable, but the disk usage grew substantially. I am thinking that we will migrate to a new ring, but in the mean time I would like to see if I can reduce some disk usage. The ring is stable, and repairs are looking fine.
If we are able to cleanup one cf it would relieve disk space usage a ton.
nodetool cleanup is not reducing the size of the sstables.
If we have a new node join the cluster it is using approximately 50% of the disk space as the other nodes.
We could do the dance of nodetool decommision && nodetool join, but that is not going to be fun :)
We have validated that the data in the ring is consistent, and repairs show that the data is consistent across the ring.
Adding a new node and successfully running repair means the data for the partition range(s) that has(have) been assigned to that node has been streamed to the new node.
If, after this has happened, you run nodetool cleanup, any data from the other nodes that is no longer needed is cleaned up.
If you still see that some of your nodes have more data than others, this may be because you have some wider rows in some of your partitions, or because your nodes are unbalanced. There should not be any data duplication scenario (if you can prove this then it would be jira worthy).
You can run rebalance in OpsCenter or manually re-assign your tokens if you are looking to spread out the data more evenly across your nodes (or design your data model to avoid the aforementioned wide rows).
Use nodetool compact to clean up all the tombstones and compacts all the updated records into single record.
{nodetool compact}

nodetool repair across replicas of data center

Just want to understand the performance of 'nodetool repair' in a multi data center setup with Cassandra 2.
We are planning to have keyspaces with 2-4 replicas in each data center. We may have several tens of data centers. Writes are done with LOCAL_QUORUM/EACH_QUORUM consistency depending on the situation and reads are usually done with LOCAL_QUORUM consistency. Questions:
Does nodetool repair complexity grow linearly with number of replicas across all data centers?
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance.
To understand how nodetool repair affects the cluster or how the cluster size affects repair, we need to understand what happens during repair. There are two phases to repair, the first of which is building a Merkle tree of the data. The second is having the replicas actually compare the differences between their trees and then streaming them to each other as needed.
This first phase can be intensive on disk io since it will touch almost all data on the disk on the node on which you run the repair. One simple way to avoid repair touching the full disk is to use the -pr flag. When using -pr, it will disksize/RF instead of disksize data that repair has to touch. Running repair on a node also sends a message to all nodes that store replicas of any of these ranges to build merkle trees as well. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data.
The factor which determines how the repair operation affects other data centers is the use of the replica placement strategy. Since you are going to need consistency across data centers (EACH_QOURUM cases) it is imperative that you use a cross-dc replication strategy like the Network Topology strategy in your case. For repair this will mean that you cannot limit yourself to local dc while running the repair since you have some EACH_QUORUM consistency cases. To avoid a repair affecting all replicas in all data centers, you should a) Wrap your replication strategy using Dynamic snitch and configure the badness threshold properly b) Use -snapshot option while running the repair.
What this will do is take a snapshot of your data (snapshots are just hardlinks to existing sstables, exploiting the fact that sstables are immutable, thus making snapshots extremely cheap) and sequentially repair from the snapshot. This means that for any given replica set, only one replica at a time will be performing the validation compaction, allowing the dynamic snitch to maintain performance for your application via the other replicas.
Now we can answer the questions you have.
Does nodetool repair complexity grow linearly with number of replicas across all data centers?
You can limit this by wrapping your replication strategy with Dynamic snitch and pass -snapshot option during repair.
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers.
The complexity will grow in terms of running time with the number of replicas if you use the approach above. This is because the above approach will do a sequential repair on one replica at a time.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance.
From nodetool repair perspective IMO, this does not make any difference if you take the above approach. Since it depends on the overall number of replicas.
Also, the goal of repair using nodetool is so that deletes do not come back. The hard requirement for routine repair frequency is the value of gc_grace_seconds. In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility. One of the recommended ways to avoid frequent repairs is to have immutability of records by design. This may be important to you since you need to run on a tens of data centers and ops will otherwise already be painful.

Cassandra node - rebuild v.s. repair

What is the difference between:
a) nodetool rebuild
b) nodetool repair [-pr]
In other words, what exactly do the respective commands do?
nodetool rebuild: is similar to the bootstrapping process (when you add a new node to the cluster) but for a datacenter. The process here is mainly a streaming from the already live nodes to the new nodes (the new ones are empty). So after defining the key ranges for the nodes which is very fast, the rest can be seen as a copy operation.
nodetool repair -pr: is not a copy operation, the node being repaired is not empty, it already contains data but if the replication factor is greater than 1 that data needs to be compared to the data on the rest of the replicas and if there is a difference it will be corrected. The process involves a lot of streaming but it is not data streaming: the node being repaired requests a merkle tree (basically a tree of hashes) in order to verify if the information both nodes have is the same or not, if not it requests a full stream of the section of the data that has any difference (so all the replicas have the same data). Streaming this hashes if faster than streaming the whole data before verification, this works under the assumption that most data will be the same on both nodes except for some differences here and there. This process also removes tombstones created when deleting from the database, defining like a new "checkpoint" after which new tombstones will be created upon deletion of data, but the old ones will not be used anymore.
Hope it helps!

Resources