Bring back a dead datacenter: repair or rebuild - cassandra

I had Cassandra cluster running across two data centers, for some reason, one data center was taken down for a while, and now I'm planning to bring it back. I'm thinking about two approaches:
One is to start up all Cassandra nodes of this data center and run "nodetool repair " on each node one by one. But It looks like 'repair' takes long time. I had an experience to repair 6GB data on a node before, it took me 5 hours on one node (3 nodes cluster). I have much more data on the cluster now, can't image how long it will take.
So I'm thinking if I can run re-build instead of repair. I can delete all old data on this data center and re-build it as adding a new data center. But not sure if it works and how performance would be.
Any idea on it? Any suggestion would be appreciated. Thanks in advance.

If the data center was down for more than 10 days then rebuild is the only option. This has to do with the tombstones. I am not 100% sure how that works across different data centers, but if you have a server that was down for more than 10 days, any data that has been deleted in the live server has been tombstoned and kept there for 10 days, then removed completely. If all suddenly your downed server wakes up from the sleep, with all deleted data not tombstoned, then it will be repopulated back to the ring via read repair or regular repair operation.
Another thing to consider, is how much data has changed/deleted since the datacenter went down. If a lot, then it obviously it is less work to rebuild. If not then maybe repair will be faster.
You can create another datacenter, add nodes to it with auto_bootstrap: false and then run nodetool rebuild <live dc name>
Good luck!

Related

nodetool repair taking a long time to complete

I am currently running Cassandra 3.0.9 in a 18 node config. We loaded quite a bit of data and now are running repairs against each node. My nodetool command is scripted to look like:
nodetool repair -j 4 -local -full
Using nodetool tpstats I see the 4 threads for repair but they are repairing very slowly. I have 1000's of repairs that are going to take weeks at this rate. The system log has repair items but also "Redistributing index summaries" listed as well. Is this what is causing my slowness? Is there a faster way to do this?
Repair can take a very long time, sometime days, sometime weeks. You might improve things with the following:
Run primary partition range repair (-pr) This will repair only the primary partition range of each node, which overall, will be faster (you still need to run a repair on each node, one at a time).
Using -j is not necessarily a big winner. Sure, you will repair multiple tables at a time, but you put much more load on your cluster, which can damage your latency.
You might want to prioritize repairing the keyspaces / tables that are most critical to your application.
Make sure you keep your node density reasonable. 1 to 2TB per node.
Focus repairing in priority the nodes that went down for more than 3 hours (assuming max_hint_window_in_ms is set to it's default value)
Focus repairing in priority the tables for which you create tombstones (DELETE statements)

Cassandra 'nodetool repair -pr' taking way too much time

I am running a cluster with 1 datacenter (10 nodes) and Cassandra 2.1.7 installed on each. We are using SimpleStretegy (old mistake).
The situation is, I have not run any nodetool repair since begining, and now there is data of about 200 GB with 3 RF.
As running full repair or incremental repair is same at this point. So I have tried to run full repair. But this result in coordinator node down.
So I end up running full partition ranges repair (nodetool repair -pr) on each node one at a time. But this is taking way too much time (15+ hrs for each node, hence weeks for all nodes).
Am I doing this wrong, or this is supposed to happen? Or this is a version problem?
In future, if I run full repair again after finishing this, would this take weeks as well?
Since full repair is mainly affected by data size, it should take same amount of time.
I suggest moving to incremental repairs, this should save your time and resources.
Here's a article about how to do this in 2.1:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
If your date size too big, you can use Sub-range repair, it's smiliar to repair pr but it's focus in sub range.
For more explain :
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra

Restoring cassandra from snapshot

So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )

Cassandra Replicas Down during nodetool repair?

I am developing an automated script for nodetool repair which would execute ever weekend on all the 6 Cassandra nodes. We have 3 in DC1 and 3 in DC2. Just want to understand worst case scenario. What would happens if connectivity between DC1 and DC2 is lost or couple of replica goes down before or during a nodetool repair. It could be a network issue, an network upgrade(which usually happens on weekends),or something else. I understand that nodetool repair computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas. So if their is no connectivity between replicas how would a nodetool repair behave ? Will it really repair the nodes. Do i have to rerun node tool repair after all nodes are up and connectivity is restored. Will their be any side effects of this event ? I goggled about it but couldn't find much details. Any insight would be helpful.
Thanks.
Let's say you are using vnodes, which by default means that each node has 256 ranges, but the idea is the same.
If the network problem happens after nodetool repair already started you will see in the logs that some ranges where successfully repaired and other don't. The error will say that the range repair failed because node "192.168.1.1 is dead" something like that.
If the network error happens before nodetool repair starts all the ranges will fail with the same error.
In both cases you will need to run another nodetool repair after the network problem is solved.
I don't know the amount of data you have in those 6 nodes, but in my experience if the cluster can handle it it is better to run nodetool repair once a week in a different day of the week. For instance you can repair node 1 on Sunday, node 2 on Monday and so on. If you have a small amount of data or the adds/updates during a day are not too many you can even run a repair once a day. When you have an already repaired cluster and you run nodetool repair more often it will take much less time to finish, but again if you have too much data in it it may not be possible.
Regarding the side effects you can only note a difference in the data if you use consistency level 1, if it happens that you run a query against the "unrepaired" node the data will be different than the one on the "repaired" nodes. You can solve this by increasing the consistency level to 2 for instance, then again if 2 nodes are "unrepaired" and the query you run is resolved using those 2 nodes you will see a difference again. You have a trade-off here since the best option to avoid this "difference" in the queries is to have the consistency level = replication factor, which brings another problem when 1 of the nodes is down the entire cluster is down and you'll start receiving timeouts on your queries.
Hope it helps!
There are multiple repair options available, you can choose one depending upon your application usage. If you are using DSE Cassandra then I would recommend scheduling OpsCenter repair which does incremental repair by giving duration less than gc_grace_seconds.
Following are different options of doing repair:
Default (None): Will repair all 3 partition ranges: 1 primary and 2 replicas owned by the node on which it was run. Total of 5 nodes will be involved 2 nodes will be fixing 1 partition range, 2 nodes will be fixing 2 partition ranges, 1 node will be fixing 3 partition ranges.
-par: Will do the above operation in parallel.
-pr : Will fix only primary partition range for the node on which it was run. If you are using write consistency of EACH_QUORUM then use -local option as well to reduce cross DC traffic.
I would suggest going with option 3 if you are already live in production to avoid any performance impacts due to repair.
If you want read about repair in more detail please have a look at this here

Cleaning and rejoining same node in cassandra cluster

We have Cassandra-0.8.2 cluster of 24 nodes and replication factor 2 . One of the node is quite slow and most of sstables on this node is corrupt.(We are not able to run compaction and not even scrub)
So is it possible to clean the data,cache and commitlog directories for this node and restart with bootstrap=true? Will it help to get all the data stream back to this node?
If it is possible , is there anything that could create issue?What care should be taken to avoid any danger?
As long as you have your replication factor set to 2. you should not have a problem to clean up and restart the machine node. But it would take some time for the data to flow through, Mine took around 4 hours. A Best way to analyse this visually is to install Opscenter from the DataStax. Its a great tool. There is no danger. Let us know if you were able to succeed!
Also it is advisible to upgrade to Cassandra 1.0. it is much faster! & you will instantly notice the difference.

Resources