How do I enable incremental repair on Cassandra 2.1.13? - cassandra

I want to enable incremental repairs on Cassandra 2.1.13?. This mailing list post states
Yes, it should now be safe to just run a repair with -inc -par to migrate
to incremental repairs
. However, Datastax documentation states a six step checklist is required.
Is it fine to simply do nodetool repair -inc -par? Are there any risks? The checklist worries me since a full repair can take a really long time and I risk getting timeouts if I have many small sstables due to compaction being disabled.

Related

Is there a way to speed up cassandra nodeltool repair ?

I've 10 nodes of Cassandra Cluster and currently installed version is 3.0.13.
How I launched : nodetool repair -j 4 -pr
Would like to know if there are some configuration options to speed up this process, I still see "Anticompaction after repair" is in progress when i check for compactionstats.
The current state of the art way of doing repairs are subrange repairs running all the time. See http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html for some explanations:
While the idea behind incremental repair is brilliant, the implementation still has flaws that can cause severe damage to a production cluster, especially when using LCS and DTCS. The improvements and fixes planned for 4.0 will need to be thoroughly tested to prove they fixed incremental repair and allow it to be safely used as a daily routine.
That beeing said (or quoted), have a look at http://cassandra-reaper.io/ - a simple and easy tool managing your repairs.

nodetool repair taking a long time to complete

I am currently running Cassandra 3.0.9 in a 18 node config. We loaded quite a bit of data and now are running repairs against each node. My nodetool command is scripted to look like:
nodetool repair -j 4 -local -full
Using nodetool tpstats I see the 4 threads for repair but they are repairing very slowly. I have 1000's of repairs that are going to take weeks at this rate. The system log has repair items but also "Redistributing index summaries" listed as well. Is this what is causing my slowness? Is there a faster way to do this?
Repair can take a very long time, sometime days, sometime weeks. You might improve things with the following:
Run primary partition range repair (-pr) This will repair only the primary partition range of each node, which overall, will be faster (you still need to run a repair on each node, one at a time).
Using -j is not necessarily a big winner. Sure, you will repair multiple tables at a time, but you put much more load on your cluster, which can damage your latency.
You might want to prioritize repairing the keyspaces / tables that are most critical to your application.
Make sure you keep your node density reasonable. 1 to 2TB per node.
Focus repairing in priority the nodes that went down for more than 3 hours (assuming max_hint_window_in_ms is set to it's default value)
Focus repairing in priority the tables for which you create tombstones (DELETE statements)

Restoring cassandra from snapshot

So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )

Is nodetool repair necessary if Cassandra is configured never to perform gc and all reads and writes are quorum?

This is a two-part question regarding nodetool repair and garbage collection.
Let's consider a replication factor of 3 for all tables, and suppose reads and writes require two confirmations of success to succeed. Based on my understanding of Cassandra, a successful write or delete would never be in danger of being missed as long as a read requires at least two responses, accepting only only the latest timestamp. This makes sense to me, but is it correct?
As a closely related question, if I configure Cassandra never to perform GC, but still perform nodetool repair periodically, will this suffice to garbage-collect old tombstones? Intuitively, a successfully repaired key range should not need to keep tombstones, so they could in theory be discarded when a repair is performed. Is this the case?
If my above two hypotheses are correct, it seems like we can achieve the following:
Consistent reads and writes with no resurrected data (due to quorum reads and writes and avoiding GC completely)
No unbounded growth in stale tombstones (due to periodically running nodetool repair, which hopefully performs GC if my above hypothesis is correct)
This post explains that quorum doesn't guarantee consistency: Read Operation in Cassandra at Consistency level of Quorum?
Assuming "GC" means compaction, I don't think nodetool repair will suffice to delete tombstones or take care of other compaction tasks. https://issues.apache.org/jira/browse/CASSANDRA-6602 describes a compaction-less scenario that sounds like what you're considering. If this is what you're doing, the recommended solution is to use DateTieredCompactionStrategy (DTCS) to store data written within a certain period of time in the same SSTable. DTCS was released in Cassandra 2.1.1 today and is described here: http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/tabProp.html?scroll=tabProp__moreCompaction

Clarifications about nodetool repair -pr

From the documentation:
Using the nodetool repair -pr (–partitioner-range) option repairs only the primary range for that node, the other replicas for that range still have to perform the Merkle tree calculation, causing a validation compaction. Because all the replicas are compacting at the same time, all the nodes may be slow to respond for that portion of the data.
There is probably never a time where I can accept all nodes to be slow for a certain portion of the data. But I wonder: Why does it do that (or is there maybe just a mixup with the "-par" option in the documentation?!), when nodetool repair seems to be smarter:
By default, the repair command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots. For example, if you have RF=3 and A, B and C represents three replicas, this command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots (A<->B, A<->C, B<->C) instead of repairing A, B, and C all at once. This allows the dynamic snitch to maintain performance for your application via the other replicas, because at least one replica in the snapshot is not undergoing repair.
However, the datastax blog addresses this issue:
This first phase can be intensive on disk io, however. You can mitigate this to some degree with compaction throttling (since this phase is what we call a validation compaction.) Sometimes that isn’t enough though, and some people try to mitigate this further by using the -pr (–partitioner-range) option to nodetool repair, which repairs only the primary range for that node. Unfortunately, the other replicas for that range will still have to perform the Merkle tree calculation, causing a validation compaction. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data. Fortunately, there is way around this by using the -snapshot option.
That could be nice, but actually, there is no -snapshot option for nodetool repair (see the manpage, or the documentation) (has this option been removed?!)
So overall,
I cannot use nodetool repair -pr, it seems, because I always need at least to keep the system responsive enough to read/write with consistency ONE, without significant delay. (Note: We have only one data center.) Or am I missing/misunderstanding something?
Why is nodetool repair smart, keeping one node responsive, while nodetool repair -pr makes all nodes slow for a portion of data?
Where is the -snapshot option: Has it been removed, never implemented, or does it now maybe automatically work like that, also when using nodetool repair -pr?
The blog below addresses these issues:
http://www.datastax.com/dev/blog/repair-in-cassandra
A simple nodetool repair will not only kick off a repair on the node itself but also all the nodes that hold replicas if its ranges. While this is ok, it is very expensive and typically not an operation you'll carry out on a busy production system during peak times.
Consequently nodetool repair -pr will carry out a repair of the primary ranges on that node. You will need to run this on every node of the cluster as the blog says. Customers with large production systems will typically use this in a rolling fashion across their cluster.
On another note Datastax OpsCenter offers the repair service which runs smaller sub-range repairs all the time so although you're always repairing its going on in the background all the time at a lower resource level.
As for the snapshots, running a regular repair will invoke a snapshot as you stated, you can also invoke a snapshot yourself using nodetool snapshot
Hope this helps!

Resources