So I have installed cassandra-reaper, and I have setup schedules for every Wednesday to repair my projects db. I'm just wondering if there is any need to schedule also a repair for the cassandra-reapers database, which was created?
I think, No because Reaper is just UI to schedule and manage Cassandra cluster.
It improves the existing nodetool repair process by
Splitting repair jobs into smaller tunable segments.
Handling back-pressure through monitoring running repairs and pending compaction.
Adding ability to pause or cancel repairs and track progress precisely.
Reaper ships with a REST API, a command line tool and a web UI.
Related
I'm monitoring a DSE cluster and I see the following problem:
As you can see it says that the Repair is currently failing, this value keeps going up with time. Can someone explain to me what's happening in here? In the Opscenter logs I can only find this error:
Is this related to the problem?
Checked logs and documentation.
In DSE there are two ways to perform anti-entropy repair:
Traditional Cassandra repair using nodetool repair command
NodeSync that is often faster and more intelligent (see this blog post for more details)
But you couldn't use traditional repair on the tables where NodeSync is enabled. So you need to click on settings icon for Repair and disable running it on the keyspaces/tables with NodeSync enabled.
To add to Alex Ott's excellent response, NodeSync is a new feature in DataStax Enterprise which runs a repair continuously in the background using the same mechanism as read-repairs and replaces the traditional anti-entropy repairs.
The OpsCenter Repair Service will skip repairs on tables which have NodeSync enabled because it isn't possible to run traditional repairs on them as I've explained in this post -- https://community.datastax.com/questions/3879/.
If NodeSync was enabled on a table while a repair on that same table was already scheduled and running, it would explain why you're seeing error messages.
You can stop the errors from being generated by explicitly excluding the keyspace(s) or table(s) from subrange repairs with:
[repair_service]
ignore_keyspaces=ks_name_1,ks_name_2
ignore_tables=ks_name_3.table_name_1,ks_name_3.table_name_2
I have a small cassandra DSC 2.2.8 cluster with 4 nodes that is for a long time now in service (more than 6 months). I have never run repair again and I am afraid that there may be deleted data resurrected. Is now too late for a repair? If I run nodetool repair the default is parallel mode, do I still need to run it in all 4 nodes one by one?
Nodetool Repair is a good way to optimize your node. Also improves the performance of the node. This will not resurrect the deleted data, in fact, will perform compaction(that will keep the latest record in database). You can perform repair on a DC as well as individual node.
I am adding new nodes to my existing cassandra cluster which is running on Vnodes with cassandra version 2.1.16. I had cron jobs scheduled for the repairs to run on these nodes. Before adding the new nodes I had disabled the cron jobs, but I am confused whether I should enable the repairs after both token moves and cleanups are completed or can I enable it after token moves before cleanups?
You can enable your repair jobs after you do the cleanup. I suggest reading this article, especially the Gotchas section for the Range movement. If the scenario described there applies to you, then you would need to run repair manually on the node, after bootstrapping.
Once your node added in your existing cluster and new node showing status UN. you can run repair and after this you can run cleanup as well.
Will any issues arise if I deprioritize the Cassandra "nodetool repair" command using "nice" ? It causes high CPU "user time" load and is having a negative impact on our production systems, causing API timeouts on our Usergrid implementation. I see documentation on limiting network throughput, but iowait does not appear to be the issue. Additionally, are there any good methods for mitigating this problem?
The nodetool command doesn't actually do any work. It just calls a JMX operation in C* to kick off the repair and then listens for updates to print out. Doing nice wont make any difference. There are a couple main phases to the repair
build merkle trees (on each node)
stream changes
compactions
Possibly the validation compaction (on some versions can be controlled with compaction throttle) or the streams (can set stream throughput via nodetool or cassandra.yaml) are burning your CPU. If so can try using the throttles, but in some versions it wont make a difference.
After the repair is completed there are normal compactions that kick off for anti compaction in incremental repairs, and also for full repairs if theres a lot of differences streamed. Some problems are very version specific, so pay attention to logs and when CPU is high to drill down more.
I want to run a repair for specific Cassandra datacenter within a larger cluster. How can I do that nodetool repair -local -pr does not seem to work:
$ nodetool repair -local -pr
Exception in thread "main" java.lang.RuntimeException: Primary range repair should be performed on all nodes in the cluster.
at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1680)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1378)
Per CASSANDRA-7317 you should only use -pr when you mean to run repair -pr on ALL the nodes in your cluster (this includes all your data centers). Otherwise, you may end up missing some token ranges in your repair.
The error message you are seeing was introduced in c* 2.0.9 prevent users from running -local and -pr together.
If you just want to repair a local dc, don't use the -pr flag.
To decrease the impact of running repairs check out these options:
OpsCenter Repair Service
Takes care of your repairs automatically and spreads them out across your gc_grace period so that you don't 1) have to worry about repairs from an operational perspective and 2) your cassandra ingest isn't affected by an expensive weekly job (repairs are CPU and IO intensive).
Repair Service Alternative
If you're not on DSE, the repair service will be grayed out. You can write and manage your own repair service like script. Check out Stump's github for an example of what this might look like.
Note: Keep an eye on this ticket CASSANDRA-6434
The above answer is correct in all aspects, EXCEPT that Stump's github is no longer maintained by Stump and is only for reducing the effects of broken streams on repairs (longer repairs result in a greater chance of a broken stream, which is an unrecoverable failure). Use Gallew's github for a current version.
For an actual repair service, you might try Limelight Network's github.