I am using Cassandra v2.1.13. Cluster with 21 nodes and 3 DC's. This is our new cluster, migrated data using sstableloader which in online now. No repairs are done before. When I run nodetetool repair for the first time after it was online, on seed nodes, I get this error
[2019-01-04 06:36:31,897] Repair command #21 finished
error: nodetool failed, check server logs
-- StackTrace --
java.lang.RuntimeException: nodetool failed, check server logs
at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:294)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:206)
Above error says nodetool failed but the repair command was finished. I checked logs, there are no WARN/ERROR messages. I performed repair on 3 seed nodes on 3 DC's. got the same error. Repaired one node at a time.
What does this error mean exactly? Is this because nodetool lost connection after repair? Is repair completed successfully on these 3 nodes? Is it safe to continue repair on other nodes in the cluster with this error message?
Can someone please help me to understand my questions and troubleshoot this issue?
Related
Whenever I am restarting my any Cassandra node in my cluster after few minutes other nodes are showing down, sometimes other nodes also hanging. We need to restart other nodes to up the services.
During restart cluster seems unstable and one after other showing stress and DN status however JVM and nodetool services are running fine but when we are describing the cluster it is showing unreachable.
We don't have much traffic and load in our environment. Can you please give me any suggestion.
Cassandra version is 3.11.2
Do you see any error/warning in your system.log after the restart of the node?
I have a 6 node Cassandra cluster and i've tested following scenario
i'm turn off 3 nodes, and on remaining 3 nodes i'm drop the table and re-create it, and after 3 node comes up, i'm unable to do repair, its says
[Uzbekistan#Gentoo]: nodetool repair --full
Repair command #2 failed with error Got negative replies from endpoints [ ip's of nodes that i turned off ]
and in the logs from node that i turned off
ERROR [AntiEntropyStage:1] 2020-08-21 16:13:12,497 RepairMessageVerbHandler.java:177 - Table with id 6a483210-e395-11ea-8da8-990844948c57 was dropped during prepare phase of repair
but why this case happens? and how to fix this? thanks
You are having schema disagreement between nodes of the cluster. If you run nodetool describecluster, then you will see that. For resolving it restart all the nodes and run nodetool describecluster. If no schema mismatch, then you should be able to run repair.
I tried removing a node from a cluster by issuing "nodetool decommission"
and have seen netstats to find out how much data is being distributed to other nodes which is all fine.
After the node has been decommissioned, I could see the status of few nodes in a cluster as 'UD' when I run nodetool status on few nodes(Not the one I decommissioned) and few nodes are showing 'UN' as status
I'm quite confused about why the status on nodes is showing such behavior, and not same on all nodes after the decommissioned the node.
Am I missing any steps before and after?
Any comments/Help is highly appreciated!
If gossip information is not the same in all nodes, then you should do a rolling restart on the cluster. That will make gossip reset in all nodes.
Was the node you removed a seed node? If it was, don't forget to remove the IP from the cassandra.yaml in all nodes.
I have configured cassandra 3.0.9 on 3 nodes but I have to use only 1 node for sometime. I have disconnected other 2 nodes from network also removed the entries of both the nodes from Cassandra.yaml, rackdc and topology files.
When I check node tool status it shows me both the down nodes. When I try to execute any query on cqlsh it gives me below error:
Blockquote
OperationTimedOut: errors={'127.0.0.1': 'Request timed out while waiting for schema agreement. See Session.execute_async and Cluster.max_schema_agreement_wait.'}, last_host=127.0.0.1
Blockquote
Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
How I can resolve this?
That's not how you remove a node from a Cassandra cluster. In fact, what you're doing is quite dangerous. Typically, you'd use nodetool decommission. If your other two nodes are still intact and just offline, I suggest bringing them back online temporarily and let decommission do its thing.
I'm going to also throw this out there - it's possible you're missing a good portion of your data with the steps you did above unless all keyspaces had RF=3. Cassandra distributes data evenly between the nodes in a respective DC. The decommission step I mention above redistributes the data.
Now if you don't have the other 2 nodes to run a nodetool decommission, you may have to remove the node with nodetool removenode and in the worst case, nodetool assassinate.
Check these docs for reference and the full steps to removing a node: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddingRemovingNodeTOC.html
A Cassandra 2.0.6 cluster consists of two DCs with a few nodes. There are ongoing writes, but nothing that the cluster would not handle. Under this small pressure I run nodetool repair -par -pr which should run repair in parallel, only for the primary range for the given node. The repair is run once at a time. Sometimes an error occurs:
Streaming error occurred java.io.IOException: Connection timed out
but as streaming_socket_timeout_in_ms is set to 10 minutes, the streaming should be retried on its own I presume. Looking through the logs, it looks like the repair is finally finished as next entries are only about compactions.
How can I know that a node is hanging should be restarted?
What's going on with the nodetool?