Cassandra nodetool repair hangs - cassandra

A Cassandra 2.0.6 cluster consists of two DCs with a few nodes. There are ongoing writes, but nothing that the cluster would not handle. Under this small pressure I run nodetool repair -par -pr which should run repair in parallel, only for the primary range for the given node. The repair is run once at a time. Sometimes an error occurs:
Streaming error occurred java.io.IOException: Connection timed out
but as streaming_socket_timeout_in_ms is set to 10 minutes, the streaming should be retried on its own I presume. Looking through the logs, it looks like the repair is finally finished as next entries are only about compactions.
How can I know that a node is hanging should be restarted?
What's going on with the nodetool?

Related

nodetool failed, check server logs - error during repair

I am using Cassandra v2.1.13. Cluster with 21 nodes and 3 DC's. This is our new cluster, migrated data using sstableloader which in online now. No repairs are done before. When I run nodetetool repair for the first time after it was online, on seed nodes, I get this error
[2019-01-04 06:36:31,897] Repair command #21 finished
error: nodetool failed, check server logs
-- StackTrace --
java.lang.RuntimeException: nodetool failed, check server logs
at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:294)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:206)
Above error says nodetool failed but the repair command was finished. I checked logs, there are no WARN/ERROR messages. I performed repair on 3 seed nodes on 3 DC's. got the same error. Repaired one node at a time.
What does this error mean exactly? Is this because nodetool lost connection after repair? Is repair completed successfully on these 3 nodes? Is it safe to continue repair on other nodes in the cluster with this error message?
Can someone please help me to understand my questions and troubleshoot this issue?

Cassandra DSC repair after very long time

I have a small cassandra DSC 2.2.8 cluster with 4 nodes that is for a long time now in service (more than 6 months). I have never run repair again and I am afraid that there may be deleted data resurrected. Is now too late for a repair? If I run nodetool repair the default is parallel mode, do I still need to run it in all 4 nodes one by one?
Nodetool Repair is a good way to optimize your node. Also improves the performance of the node. This will not resurrect the deleted data, in fact, will perform compaction(that will keep the latest record in database). You can perform repair on a DC as well as individual node.

Do we need to run the nodetool repair on every node in the cluster ?

is nodetool repair command need to be run on the all cluster nodes ?
i understand that this command repair the replica on a node with the other replicas and we need to run it on all node to get high consistency.
The "nodetool repair" on single node only triggers a repair on its range of tokens with other nodes in cluster. You need it run in every node sequentially, for the complete data in cluster to be repaired.
Also a good alternative/recommendation is to use "nodetool repair -pr". The "-pr" option indicates that only a primary-range of tokens in a given node is repaired. But again this needs to be run on every node in every DC of the cluster.
The repair command only repairs token ranges on the node being
repaired, it doesn't repair the whole cluster.By default, repair will operate on all token ranges replicated by the node you’re running repair on, which will cause duplicate work if you run it on every node. The -pr flag will only repair the “primary” ranges on a node, so you can repair your entire cluster by running nodetool repair -pr on each node in a single datacenter. Reference

Cassandra's nodetool repair appears to cause outages

I have a 4 node Cassandra cluster that didn't see a repair() for about 8 months, in between administrators. It doesn't see much in the way of deletes. I've noticed that when I run nodetool repair, the system will not accept new data, and nobody can connect with cqlsh until the repair is completed. Is it normal for repair to cause downtime?

How do I run a repair only within a certain datacenter?

I want to run a repair for specific Cassandra datacenter within a larger cluster. How can I do that nodetool repair -local -pr does not seem to work:
$ nodetool repair -local -pr
Exception in thread "main" java.lang.RuntimeException: Primary range repair should be performed on all nodes in the cluster.
at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1680)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1378)
Per CASSANDRA-7317 you should only use -pr when you mean to run repair -pr on ALL the nodes in your cluster (this includes all your data centers). Otherwise, you may end up missing some token ranges in your repair.
The error message you are seeing was introduced in c* 2.0.9 prevent users from running -local and -pr together.
If you just want to repair a local dc, don't use the -pr flag.
To decrease the impact of running repairs check out these options:
OpsCenter Repair Service
Takes care of your repairs automatically and spreads them out across your gc_grace period so that you don't 1) have to worry about repairs from an operational perspective and 2) your cassandra ingest isn't affected by an expensive weekly job (repairs are CPU and IO intensive).
Repair Service Alternative
If you're not on DSE, the repair service will be grayed out. You can write and manage your own repair service like script. Check out Stump's github for an example of what this might look like.
Note: Keep an eye on this ticket CASSANDRA-6434
The above answer is correct in all aspects, EXCEPT that Stump's github is no longer maintained by Stump and is only for reducing the effects of broken streams on repairs (longer repairs result in a greater chance of a broken stream, which is an unrecoverable failure). Use Gallew's github for a current version.
For an actual repair service, you might try Limelight Network's github.

Resources