nodetool repair across replicas of data center - cassandra

Just want to understand the performance of 'nodetool repair' in a multi data center setup with Cassandra 2.
We are planning to have keyspaces with 2-4 replicas in each data center. We may have several tens of data centers. Writes are done with LOCAL_QUORUM/EACH_QUORUM consistency depending on the situation and reads are usually done with LOCAL_QUORUM consistency. Questions:
Does nodetool repair complexity grow linearly with number of replicas across all data centers?
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance.

To understand how nodetool repair affects the cluster or how the cluster size affects repair, we need to understand what happens during repair. There are two phases to repair, the first of which is building a Merkle tree of the data. The second is having the replicas actually compare the differences between their trees and then streaming them to each other as needed.
This first phase can be intensive on disk io since it will touch almost all data on the disk on the node on which you run the repair. One simple way to avoid repair touching the full disk is to use the -pr flag. When using -pr, it will disksize/RF instead of disksize data that repair has to touch. Running repair on a node also sends a message to all nodes that store replicas of any of these ranges to build merkle trees as well. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data.
The factor which determines how the repair operation affects other data centers is the use of the replica placement strategy. Since you are going to need consistency across data centers (EACH_QOURUM cases) it is imperative that you use a cross-dc replication strategy like the Network Topology strategy in your case. For repair this will mean that you cannot limit yourself to local dc while running the repair since you have some EACH_QUORUM consistency cases. To avoid a repair affecting all replicas in all data centers, you should a) Wrap your replication strategy using Dynamic snitch and configure the badness threshold properly b) Use -snapshot option while running the repair.
What this will do is take a snapshot of your data (snapshots are just hardlinks to existing sstables, exploiting the fact that sstables are immutable, thus making snapshots extremely cheap) and sequentially repair from the snapshot. This means that for any given replica set, only one replica at a time will be performing the validation compaction, allowing the dynamic snitch to maintain performance for your application via the other replicas.
Now we can answer the questions you have.
Does nodetool repair complexity grow linearly with number of replicas across all data centers?
You can limit this by wrapping your replication strategy with Dynamic snitch and pass -snapshot option during repair.
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers.
The complexity will grow in terms of running time with the number of replicas if you use the approach above. This is because the above approach will do a sequential repair on one replica at a time.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance.
From nodetool repair perspective IMO, this does not make any difference if you take the above approach. Since it depends on the overall number of replicas.
Also, the goal of repair using nodetool is so that deletes do not come back. The hard requirement for routine repair frequency is the value of gc_grace_seconds. In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility. One of the recommended ways to avoid frequent repairs is to have immutability of records by design. This may be important to you since you need to run on a tens of data centers and ops will otherwise already be painful.

Related

why full replication Cassandra cluster have node data size difference

I have a 3-node cassandra cluster (version 3.11.11) with replication factor 3. only 2 of the nodes are receiving requests, and Node3 only sync with the other 2 nodes.
In theory, each node should have the same data size. But in practice, I end up with nodes with different data sizes as shown in the picture.
we have daily nodetool repair, operations like compaction are done automatically with default settings.
What can be the reason for the size difference?
It finally ends up how data gets compacted in the long run. Since compaction is local process and how sstables can be stacked up cannot be guaranteed. So I dont see any abbreviation here. Theory just say all nodes will have same data logically but physically it may vary. For example in node3 you may have old sstables that are not getting compacted due to size (if using STCS) and in other nodes they have compacted and reduced the size of those nodes.

How to force Cassandra not to use the same node for replication in a schema with vnodes

Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

What does Cassandra nodetool repair exactly do?

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?
The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

What options are there to speed up a full repair in Cassandra?

I have a Cassandra datacenter which I'd like to run a full repair on. The datacenter is used for analytics/batch processing and I'm willing to sacrifice latencies to speed up a full repair (nodetool repair). Writes to the datacenter is moderate.
What are my options to make the full repair faster? Some ideas:
Increase streamthroughput?
I guess I could disable autocompation and decrase compactionthroughput temporarily. Not sure I'd want to that, though...
Additional information:
I'm running SSDs but haven't spent any time adjusting cassandra.yaml for this.
Full repairs are run sequentially by default. The state and differences of the nodes' datasets are stored in binary trees. Recreating these is the main factor here. According to this datastax blog entry, "Every time a repair is carried out, the tree has to be calculated, each node that is involved in the repair has to construct its merkle tree from all the sstables it stores making the calculation very expensive."
The only way I see to significantly increase the speed of a full repair is to run it in parallel or repair subrange by subrange. Your tag implies that you run Cassandra 2.0.
1) Parallel full repair
nodetool repair -par, or --parallel, means carry out a parallel repair.
According to the nodetool documentation for Cassandra 2.0
Unlike sequential repair (described above), parallel repair constructs the Merkle tables for all nodes at the same time. Therefore, no snapshots are required (or generated). Use a parallel repair to complete the repair quickly or when you have operational downtime that allows the resources to be completely consumed during the repair.
2) Subrange repair
nodetool accepts start and end token parameters like so
nodetool repair -st (start token) -et (end token) $keyspace $columnfamily
For simplicity sake, check out this python script that calculates tokens for you and executes the range repairs:
https://github.com/BrianGallew/cassandra_range_repair
Let me point out two alternative options:
A) Jeff Jirsa pointed to incremental repairs.
These are available starting with Cassandra 2.1. You will need to perform certain migration steps before you can use nodetool like this:
nodetool repair -inc, or --incremental means do an incremental repair.
B) OpsCenter Repair Service
For the couple of clusters at my company itembase.com, we use the repair service in DataStax OpsCenter which is executing and managing small range repairs as a service.

Datastax Cassandra Remove and cleanup one column family

After some IT cleanup, we are noticing that we should probably do a full cleanup / restore for one column family. We believe that Cassandra has duplicate data that it is not cleaning up. Is it possible to clear out and just have Cassandra rebuild a single column family from scratch or a snapshot?
During an upgrade some of the nodes decided to rejoin the cluster, rather than just restarting. During that process nodetool netstats showed that nodes where transferring new data file into the original nodes. The cluster is stable, but the disk usage grew substantially. I am thinking that we will migrate to a new ring, but in the mean time I would like to see if I can reduce some disk usage. The ring is stable, and repairs are looking fine.
If we are able to cleanup one cf it would relieve disk space usage a ton.
nodetool cleanup is not reducing the size of the sstables.
If we have a new node join the cluster it is using approximately 50% of the disk space as the other nodes.
We could do the dance of nodetool decommision && nodetool join, but that is not going to be fun :)
We have validated that the data in the ring is consistent, and repairs show that the data is consistent across the ring.
Adding a new node and successfully running repair means the data for the partition range(s) that has(have) been assigned to that node has been streamed to the new node.
If, after this has happened, you run nodetool cleanup, any data from the other nodes that is no longer needed is cleaned up.
If you still see that some of your nodes have more data than others, this may be because you have some wider rows in some of your partitions, or because your nodes are unbalanced. There should not be any data duplication scenario (if you can prove this then it would be jira worthy).
You can run rebalance in OpsCenter or manually re-assign your tokens if you are looking to spread out the data more evenly across your nodes (or design your data model to avoid the aforementioned wide rows).
Use nodetool compact to clean up all the tombstones and compacts all the updated records into single record.
{nodetool compact}

Resources