I have a new cassandra node joining an existing cluster. Current nodetool netstat shows file transfer status at 25%. My question is, do I have to wait for compaction as well or is the node joining process considered completed when file transfer reaches 100% ?
Once newly added node showing status UN, it will accept read and write requests. compaction will run in background and there is no impact however you can stop compaction as well for some time.
Related
The disk read/write rate and cpu usage of cassandra db intermittently bounce.
Casssandra was installed with docker, and node exporter and process exporter were used for monitoring. Node and process exporter are all installed with Docker.
I checked the process exporter at the time it bounced. The process that consumed the most resources during the bounced time has Java in the groupname. I'm guessing that there might be a problem with cassandra java.
No more special traffic came in at the time of the bounce.
It does not match the compaction cycle.
Clustering is not broken.
cassandra version is 4.0.3
In Cassandra 4 you have the ability to access swiss java knife (sjk) via nodetool and one of the things you get access to is ttop.
If you run the following in your cassandra env during the time your cpu is spiking you can see which threads are the top consumers, which then allows you to dial in on those threads specifically to see if there is an actual problem.
nodetool sjk ttop >> $(hostname -i)_ttop.out
Allow that to run to completion (during a period of reported high cpu), or at least for 5-10min or so if you decide to kill it early. This will collect a new iteration every few seconds, so once complete, parse the results to see which threads are regularly top consumers and what percentage of the cpu they are actually using, then you'll have a targeted approach at where to troubleshoot for potential problems in the jvm.
If nothing good turns up, go for a thread dump next for a more complete look and I recommend the following script:
https://github.com/brendancicchi/collect-thread-dumps
Need help, I have a 4 node cassandra Cluster, RF 2 and There is a Hardware maintenance activity (Total Activity time can be 30-40 mins) scheduled on one of the node .
Please let me know how we can safely do this activity without impacting the live traffic.
Can I use below steps on node (where hardware maintenance will be going on)
nodetool -h<node IP / Hostname > drain
Kill Cassnadra service.
Once activity get completed, Then start the cassandra service.
Kindly let me know if anything else need to be done.
Thanks in advance.
That's a good start, Dinesh. The shutdown scripts which I write look like this:
nodetool disablegossip
nodetool disablebinary
nodetool drain
The disable commands first take the node out of gossip, and then stop any native binary connections. Once those complete, I drain the node.
Once those have completed, I then stop the service.
While I am running nodetool decommission, I want to use 100% of my network. I set "nodetool setstreamthroughput 0". At the beginning, since the node on which decommission process started sends multiple nodes, The node can send data at speed 900Mbps. Later, since number of nodes that transferred is reducing, the node can send data like 300Mbps.
I see that the node sends one SSTable to one node. I want to increase the parallelism. nodetool says that one connection per hosts. How can I increase this setting. I mean "multiple connection per hosts" while I am streaming?
Most likely Cassandra 3.0 will not be able to utilize 100% of your network regardless of how you set it up. Even with multiple threads you push up against a point where the allocation rate of objects generated in the streaming will exceed what the jvm can clean up and then your GC pauses will only be able give you 100% for short periods. Kind of moot though as you cannot configure it to use more threads.
In cassandra 4.0 you will be able to achieve this: http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html
I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 on Ubuntu, using RF3. Occasionally I have the need to restart nodes in the cluster, but every time I do, I see errors and application (nodejs) timeouts.
I restart a node like this:
nodetool disablebinary && nodetool disablethrift && nodetool disablegossip && nodetool drain
sudo service cassandra restart
When I do that, I very often get timeouts and errors like this in my nodejs app:
Error: Cannot achieve consistency level LOCAL_ONE
My queries are all pretty much the same, things like: select * from history where ts > {current_time} (along with the partition key in the where clause)
The errors and timeouts seem to go away on their own after a while, but it is frustrating because I can't track down what I am doing wrong!
I've tried waiting between steps of shutting down cassandra, and I've tried stopping, waiting, then starting the node. One thing I've noticed is that even after nodetool draining the node, there are open connections to other nodes in the cluster (ie looking at the output of netstat) until I stop cassandra. I don't see any errors or warnings in the logs.
One other thing I've noticed is that after restarting a node and seeing application latency, I also see that the node I just restarted sees many other nodes in the same DC as being down (ie status 'DN'). However, checking nodetool status on those other nodes shows all nodes as up/normal. To me this could kind of explain the problem - node comes back online, thinks it is healthy but many others are not, so it gets traffic from the client application. But then it gets requests for ranges that belong to a node it thinks is down, so it responds with an error. The latency issue seems to start roughly when the node goes down, but persists long (ie 15-20 mins) after it is back online and accepting connections. It seems to go away once the bounced node shows the other nodes in the same DC as up again.
I have not been able to reproduce this locally using ccm.
What can I do to prevent this? Is there something else I should be doing to gracefully restart the cluster? It could be something to do with the nodejs driver, but I can't find anything there to try.
I seem to have been able to resolve the issue by issuing nodetool disablegossip as the last step in shutting down. So using this instead of my initial approach at restarting seems to work (note that only the order of drain and disablegossip have switched):
nodetool disablebinary
nodetool disablethrift
nodetool drain
nodetool disablegossip
sudo service cassandra restart
While this seems to work, I have no explanation as to why. On the mailing list, someone helpfully pointed out that the drain should take care of everything that disablegossip does, so my hypothesis is that doing the disablegossip first causes the drain to then have problems which only appear after startup.
So there is a fair amount of documentation on how to scale up a Cassandra, but is there a good resource on how to "unscale" Cassandra and remove nodes from the cluster? Is it as simple as turning off a node, letting the cluster sync up again, and repeating?
The reason is for a site that expects high spikes of traffic, climbing from the daily few thousand hits to hundreds of thousands over a few days. The site will be "ramped up" before hand, starting up multiple instances of the web server, Cassandra, etc. After the torrent of requests subsides, the goal is to turn off the instances that are not longer used, rather than pay for servers that are just sitting around.
If you just shut the nodes down and rebalance cluster, you risk losing some data, that exist only on removed nodes and hasn't replicated yet.
Safe cluster shrink can be easily done with nodetool. At first, run:
nodetool drain
... on the node removed, to stop accepting writes and flush memtables, then:
nodetool decommission
To move node's data to other nodes, and then shut the node down, and run on some other node:
nodetool removetoken
... to remove the node from the cluster completely. The detailed documentation might be found here: http://wiki.apache.org/cassandra/NodeTool
From my experience, I'd recommend to remove nodes one-by-one, not in batches. It takes more time, but much more safe in case of network outages or hardware failures.
When you remove nodes you may have to re-balance the cluster, moving some nodes to a new token. In a planed downscale, you need to:
1 - minimize the number of moves.
2 - if you have to move a node, minimize the amount of transfered data.
There's an article about cluster balancing that may be helpful:
Balancing Your Cassandra Cluster
Also, the begining of this video is about add node and remove node operations and best strategies to minimize the cluster impact in each of these operations.
Hopefully, these 2 references will give you enough information to plan your downscale.
First, on the node, which will be removed, flush memory (memtable) to SSTables on disk:
-nodetool flush
Second, run command to leave a cluster:
-nodetool decommission
This command will assign ranges that the node was responsible for to other nodes and replicates the data appropriately.
To monitor a process you can use command:
- nodetool netstats
Found an article on how to remove nodes from Cassandra. It was helpful for me scaling down cassandra.All actions are described step-by-step there.