Cassandra DSC repair after very long time - cassandra

I have a small cassandra DSC 2.2.8 cluster with 4 nodes that is for a long time now in service (more than 6 months). I have never run repair again and I am afraid that there may be deleted data resurrected. Is now too late for a repair? If I run nodetool repair the default is parallel mode, do I still need to run it in all 4 nodes one by one?

Nodetool Repair is a good way to optimize your node. Also improves the performance of the node. This will not resurrect the deleted data, in fact, will perform compaction(that will keep the latest record in database). You can perform repair on a DC as well as individual node.

Related

cassandra enable hints and repair

I am adding a new node to my cassandra cluster which is currently 5 nodes. The nodes have hints turned on and I am also running repairs using cassandra reaper. When adding the node node, the node addition is taking foreever and the other nodes are becoming unresponsive. I am running cassandra 3.11.13.
questions
As I understand hints are used to make sure writes are correctly propagated to all replicas
Cassandra is designed to remain available if one of it’s nodes is down or unreachable. However, when a node is down or unreachable, it needs to eventually discover the writes it missed. Hints attempt to inform a node of missed writes, but are a best effort, and aren’t guaranteed to inform a node of 100% of the writes it missed.
repairs do something similar
Repair synchronizes the data between nodes by comparing their respective datasets for their common token ranges, and streaming the differences for any out of sync sections between the nodes.
If I am running repairs with cassandra reaper, do I need to disable hints?
If hints are enabled and repairs are carried. Does it cause double writes of data in nodes?
Is it okay to carry repair while a node is joining?

Cassandra repair after datacenter went down

I have a Cassandra db (version 3.11.2) running in AWS, with 2 Datacenters - each in another AWS region and 3 nodes in each one.
The replication factor on all keyspaces is 3, so full replication of data on every node. The size of data is about 10GB per node.
All of our writes are in LOCAL_QUORUM against one DC (lets call it DC1). Basically the other DC is just for a kind of backup and disaster recovery, in case the AWS region for DC1 will be unavailable we will redirect traffic to DC2.
My issue is that we had a network disconnection between the two DCs, for several hours, and after several days we noticed that there is missing data in DC2. This all makes sense, since the time the DCs were apart is larger than the Hinted Handoff window (3 hours). So we need to run a repair to bring DC2 back to sync with DC1.
I went over the cassandra docs, and read countless SO answers and for the life of me I couldn't understand what is the right repair to do...
Do I need to issue a 'nodetool repair --full --sequential' from only one node? Do I need to run it on every node in the cluster? Maybe it's better to run 'nodetool rebuild'?
Executing nodetool cleanup on the nodes on datacenter2 should be able to bring up the data up to sync, but depending on the data size affected, this may be a task that can take time and resources. If the datacenter2 is only as a backup for disaster recovery purposes, it may be easier and quicker to backup the current dc1 cluster and restore it in the second datacenter (more information is available here.

Cassandra reaper - should I repair also reapers database?

So I have installed cassandra-reaper, and I have setup schedules for every Wednesday to repair my projects db. I'm just wondering if there is any need to schedule also a repair for the cassandra-reapers database, which was created?
I think, No because Reaper is just UI to schedule and manage Cassandra cluster.
It improves the existing nodetool repair process by
Splitting repair jobs into smaller tunable segments.
Handling back-pressure through monitoring running repairs and pending compaction.
Adding ability to pause or cancel repairs and track progress precisely.
Reaper ships with a REST API, a command line tool and a web UI.

Cassandra's nodetool repair appears to cause outages

I have a 4 node Cassandra cluster that didn't see a repair() for about 8 months, in between administrators. It doesn't see much in the way of deletes. I've noticed that when I run nodetool repair, the system will not accept new data, and nobody can connect with cqlsh until the repair is completed. Is it normal for repair to cause downtime?

How do I run a repair only within a certain datacenter?

I want to run a repair for specific Cassandra datacenter within a larger cluster. How can I do that nodetool repair -local -pr does not seem to work:
$ nodetool repair -local -pr
Exception in thread "main" java.lang.RuntimeException: Primary range repair should be performed on all nodes in the cluster.
at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1680)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1378)
Per CASSANDRA-7317 you should only use -pr when you mean to run repair -pr on ALL the nodes in your cluster (this includes all your data centers). Otherwise, you may end up missing some token ranges in your repair.
The error message you are seeing was introduced in c* 2.0.9 prevent users from running -local and -pr together.
If you just want to repair a local dc, don't use the -pr flag.
To decrease the impact of running repairs check out these options:
OpsCenter Repair Service
Takes care of your repairs automatically and spreads them out across your gc_grace period so that you don't 1) have to worry about repairs from an operational perspective and 2) your cassandra ingest isn't affected by an expensive weekly job (repairs are CPU and IO intensive).
Repair Service Alternative
If you're not on DSE, the repair service will be grayed out. You can write and manage your own repair service like script. Check out Stump's github for an example of what this might look like.
Note: Keep an eye on this ticket CASSANDRA-6434
The above answer is correct in all aspects, EXCEPT that Stump's github is no longer maintained by Stump and is only for reducing the effects of broken streams on repairs (longer repairs result in a greater chance of a broken stream, which is an unrecoverable failure). Use Gallew's github for a current version.
For an actual repair service, you might try Limelight Network's github.

Resources