Whenever I am restarting my any Cassandra node in my cluster after few minutes other nodes are showing down, sometimes other nodes also hanging. We need to restart other nodes to up the services.
During restart cluster seems unstable and one after other showing stress and DN status however JVM and nodetool services are running fine but when we are describing the cluster it is showing unreachable.
We don't have much traffic and load in our environment. Can you please give me any suggestion.
Cassandra version is 3.11.2
Do you see any error/warning in your system.log after the restart of the node?
I run Cassandra 3.1 in autoscaling group. Recently one of the machines failed and got replaced. I did not lose any data, but client applications were trying to connect to a node which was marked down.
I am looking for a way to gracefully remove a node from a cluster with a quick command which I would invoke via systemd right before it shuts down cassandra during the shutdown process.
nodetool decommission involves data streaming and takes long time.
nodetool removenode and nodetool assassinate can't remove the node they are running at.
Losing data is not my concern. My goal is fully automated node replacement.
Fixing client libaries is out of scope of this question
I have a test cluster on 3 machines where 2 are seeds all centos7 and all cassandra 3.4.
Yesterday all was fine they were chating and i had the "brilliant" idea to ....power all those machines off to simulate a power failure.
As a newbie that i am, i simply powered the machines back and i expected probably some kind of supermagic, but here it is my cluster is not up again, each individual refuses to connect.
And yes, my firewalld is disabled.
My question : what damage was made and how can i recover the cluster to the previous running state?
Since you abruptly shutdown your cluster, that simply means, nodes were not able to drain themselves.
Don't worry, it is unlikely any data loss happened because of this, as cassandra maintains commit logs, and will read from it when it is restarted.
First, find your seed node ip from cassandra.yaml
Start your seed node first.
Check the start up logs in cassandra.log and system.log and wait for it to start up completely, it will take sometime.
As it will read from commit log for pending actions, and will replay them.
Once it finishes starting up, start other nodes, and tail their log files.
I have an issue with a cassandra node on a 10 nodes cluster.
I first launched a decommission on that node to remove it from the cluster.
The decommission is currently active but the load on this node is such that it takes an infinite time and I would like to go faster.
What I thought to do was to stop this node and launch a removenode from another one.
The DataStax documentation explains that we should use decommission or removenode depending on the up/down status of the node. But there is no information about removenode while targeted node has already leaving status.
So my question is: Is it possible to launch a removenode of a stopped node while this one has already a leaving status?
So my question is: Is it possible to launch a removenode of a stopped node while this one has already a leaving status?
I had to do this last week, so "yes" it is possible.
Just be careful, though. At the time, I was working on bringing up a new DC in a non-production environment, so I didn't care about losing the data that was on the node (or in the DC, for that matter).
What I thought to do was to stop this node and launch a removenode from another one.
You can do exactly that. Get the Host ID of the node you want to drop, and run:
$ nodetool removenode 2e143c2b-0571-4c5d-22d5-9a2668648710
And if that gets stuck, ctrlc out of it, and (on the same node) you can run:
$ nodetool removenode force
Decommissioning a node in Cassandra can only be stopped , if that node is restarted .
Its status will change from UL to UN .
This approach is tested and cassandra cluster worked well afterwards.
Following this safe approach , trigger nodetool remove for consistent data.
We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.