Need help, I have a 4 node cassandra Cluster, RF 2 and There is a Hardware maintenance activity (Total Activity time can be 30-40 mins) scheduled on one of the node .
Please let me know how we can safely do this activity without impacting the live traffic.
Can I use below steps on node (where hardware maintenance will be going on)
nodetool -h<node IP / Hostname > drain
Kill Cassnadra service.
Once activity get completed, Then start the cassandra service.
Kindly let me know if anything else need to be done.
Thanks in advance.
That's a good start, Dinesh. The shutdown scripts which I write look like this:
nodetool disablegossip
nodetool disablebinary
nodetool drain
The disable commands first take the node out of gossip, and then stop any native binary connections. Once those complete, I drain the node.
Once those have completed, I then stop the service.
Related
I run Cassandra 3.1 in autoscaling group. Recently one of the machines failed and got replaced. I did not lose any data, but client applications were trying to connect to a node which was marked down.
I am looking for a way to gracefully remove a node from a cluster with a quick command which I would invoke via systemd right before it shuts down cassandra during the shutdown process.
nodetool decommission involves data streaming and takes long time.
nodetool removenode and nodetool assassinate can't remove the node they are running at.
Losing data is not my concern. My goal is fully automated node replacement.
Fixing client libaries is out of scope of this question
I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 on Ubuntu, using RF3. Occasionally I have the need to restart nodes in the cluster, but every time I do, I see errors and application (nodejs) timeouts.
I restart a node like this:
nodetool disablebinary && nodetool disablethrift && nodetool disablegossip && nodetool drain
sudo service cassandra restart
When I do that, I very often get timeouts and errors like this in my nodejs app:
Error: Cannot achieve consistency level LOCAL_ONE
My queries are all pretty much the same, things like: select * from history where ts > {current_time} (along with the partition key in the where clause)
The errors and timeouts seem to go away on their own after a while, but it is frustrating because I can't track down what I am doing wrong!
I've tried waiting between steps of shutting down cassandra, and I've tried stopping, waiting, then starting the node. One thing I've noticed is that even after nodetool draining the node, there are open connections to other nodes in the cluster (ie looking at the output of netstat) until I stop cassandra. I don't see any errors or warnings in the logs.
One other thing I've noticed is that after restarting a node and seeing application latency, I also see that the node I just restarted sees many other nodes in the same DC as being down (ie status 'DN'). However, checking nodetool status on those other nodes shows all nodes as up/normal. To me this could kind of explain the problem - node comes back online, thinks it is healthy but many others are not, so it gets traffic from the client application. But then it gets requests for ranges that belong to a node it thinks is down, so it responds with an error. The latency issue seems to start roughly when the node goes down, but persists long (ie 15-20 mins) after it is back online and accepting connections. It seems to go away once the bounced node shows the other nodes in the same DC as up again.
I have not been able to reproduce this locally using ccm.
What can I do to prevent this? Is there something else I should be doing to gracefully restart the cluster? It could be something to do with the nodejs driver, but I can't find anything there to try.
I seem to have been able to resolve the issue by issuing nodetool disablegossip as the last step in shutting down. So using this instead of my initial approach at restarting seems to work (note that only the order of drain and disablegossip have switched):
nodetool disablebinary
nodetool disablethrift
nodetool drain
nodetool disablegossip
sudo service cassandra restart
While this seems to work, I have no explanation as to why. On the mailing list, someone helpfully pointed out that the drain should take care of everything that disablegossip does, so my hypothesis is that doing the disablegossip first causes the drain to then have problems which only appear after startup.
using Cassandra 2.2.8 , gossipingpropertyfilesnitch
I'm repairing a node and compacting large number of sstables - i'm thinking to alleviate load on the cpu/node and want to routing incoming web traffic to other nodes in the cluster.
may you guys please share how i can route internet traffic to other nodes in the cluster so to let the node keep using cpu on the major maintenance work?
thanks in advance
Providing you have a replication factor and consistency level that can handle a node being down, you can remove the node from the cluster during the compactions
nodetool disablebinary
nodetool disablethrift
This would prevent your client application from sending requests and it acting as coordinator but it will still recieve the mutations from writes so it wont get behind. If you want to reduce load further you can completely remove it with
nodetool disablebinary
nodetool disablethrift
nodetool disablegossip
But make sure you enable gossip again before your max_hint_window_in_ms which is defined in the cassandra.yaml (default 3 hours). If you dont the hints for that node will expire and not be delivered, leading to a consistency issue that will not be resolved without a repair.
Once you reconnect wait for the pending hints and active hints are down to 0 before disabling gossip again. Note: pending will always be +1 since it has a regular scheduled task, so 1 not zero.
Can check the hint thread pool with OpsCenter, nodetool tpstats or via JMX with org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=PendingTasks and org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks
I have a test cluster on 3 machines where 2 are seeds all centos7 and all cassandra 3.4.
Yesterday all was fine they were chating and i had the "brilliant" idea to ....power all those machines off to simulate a power failure.
As a newbie that i am, i simply powered the machines back and i expected probably some kind of supermagic, but here it is my cluster is not up again, each individual refuses to connect.
And yes, my firewalld is disabled.
My question : what damage was made and how can i recover the cluster to the previous running state?
Since you abruptly shutdown your cluster, that simply means, nodes were not able to drain themselves.
Don't worry, it is unlikely any data loss happened because of this, as cassandra maintains commit logs, and will read from it when it is restarted.
First, find your seed node ip from cassandra.yaml
Start your seed node first.
Check the start up logs in cassandra.log and system.log and wait for it to start up completely, it will take sometime.
As it will read from commit log for pending actions, and will replay them.
Once it finishes starting up, start other nodes, and tail their log files.
So there is a fair amount of documentation on how to scale up a Cassandra, but is there a good resource on how to "unscale" Cassandra and remove nodes from the cluster? Is it as simple as turning off a node, letting the cluster sync up again, and repeating?
The reason is for a site that expects high spikes of traffic, climbing from the daily few thousand hits to hundreds of thousands over a few days. The site will be "ramped up" before hand, starting up multiple instances of the web server, Cassandra, etc. After the torrent of requests subsides, the goal is to turn off the instances that are not longer used, rather than pay for servers that are just sitting around.
If you just shut the nodes down and rebalance cluster, you risk losing some data, that exist only on removed nodes and hasn't replicated yet.
Safe cluster shrink can be easily done with nodetool. At first, run:
nodetool drain
... on the node removed, to stop accepting writes and flush memtables, then:
nodetool decommission
To move node's data to other nodes, and then shut the node down, and run on some other node:
nodetool removetoken
... to remove the node from the cluster completely. The detailed documentation might be found here: http://wiki.apache.org/cassandra/NodeTool
From my experience, I'd recommend to remove nodes one-by-one, not in batches. It takes more time, but much more safe in case of network outages or hardware failures.
When you remove nodes you may have to re-balance the cluster, moving some nodes to a new token. In a planed downscale, you need to:
1 - minimize the number of moves.
2 - if you have to move a node, minimize the amount of transfered data.
There's an article about cluster balancing that may be helpful:
Balancing Your Cassandra Cluster
Also, the begining of this video is about add node and remove node operations and best strategies to minimize the cluster impact in each of these operations.
Hopefully, these 2 references will give you enough information to plan your downscale.
First, on the node, which will be removed, flush memory (memtable) to SSTables on disk:
-nodetool flush
Second, run command to leave a cluster:
-nodetool decommission
This command will assign ranges that the node was responsible for to other nodes and replicates the data appropriately.
To monitor a process you can use command:
- nodetool netstats
Found an article on how to remove nodes from Cassandra. It was helpful for me scaling down cassandra.All actions are described step-by-step there.