Why shouldn't you run nodetool removenode? - cassandra

I am wondering why it is a best practice to not run nodetool removenode. What is it used for? Is there a hierarchy of commands to run instead? What kind of issues arise when running said command? Any first hand experience/nightmare stories of using removenode? Overarching why not?

A default preference order would be:
Replacement node option (if replacement is planned)
Decomission
RemoveNode
Assassinate
However - there are situations where you will still choose a lower entry over an earlier one.
If the node being removed is operational, then you would normally run a decommission and allow the node to stream the data from itself to other nodes which will now be holding one of the replicas that was previously on the node being removed.
Removing a node will cause the token ranges to be recalculated and move, potentially requiring all nodes to start streaming data to other nodes who now own that range.
If the node is not operational, you can perform a nodetool removenode - this will trigger the same range movement and cause a large amount of streaming. There are streaming throughput throttles that are in place by default and can be adjusted to limit this impact.
You can also forcibly terminate either a decommission or a removenode by using nodetool [decommission | removenode] force - however, this means that one of the replicas of the data has not been re-established to another node, leaving you with less resilience.
Why would you do that? for the same streaming reason, if you accept the loss of resilience for a period of time, you can roll out the repair node by node in a controlled manner. This option should not be considered your 'default approach' or the choice taken lightly - I cannot stress or make that in bold enough.
The final option, when decommission / removenode is not available, is to assassinate the node - this is pretty much the same as performing a removenode, followed by an immediate force. You then have to manage to repairs and clean up in the same manner.
Outside of all of these 3 options - the best option is if your intention is to replace the node, then performing a replacement instead of a remove / add is the winner - this will only then require the new node has data streamed to it from the other replica's and there is no further token ring range movement. Instructions here
If the data disks are available, it is also possible to bring the replacement up without streaming the data, instructions here

The Datastax Documentation goes over the use case for nodetool removenode in depth.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsRemoveNode.html
The gist of why it can be bad is this:
Warning: This command triggers cluster streaming. In large
environments, the additional streaming activity causes more pending
gossip tasks in the output of nodetool tpstats. Nodes can start to
appear offline and may need to be restarted to clear up the back log
of pending gossip tasks.
According to the document, this is when it should be used:
When the node is down and nodetool decommission cannot be used, use nodetool removenode. Run this command only on nodes that are down.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

why lost some data after nodetool cleanup in cassandra

We added a new node to datacenter and then run nodetool cleanup according to Add new node to existing cluster in cassandra. But after cleanup completed, we noticed that we lost some data.
What could be the reason?
Yes, it's important to understand that nodetool cleanup is a potentially destructive tool. Your cluster needs to be in a fully-repaired state (from regular, successful runs of nodetool repair prior).
When you add a new node to the cluster, the token ranges that each node is responsible for are adjusted, and lowered per node. This leaves data on the original nodes that they are no longer responsible for. And that is by design.
The idea was that if for whatever reason the node add process failed and you had to leave your cluster at its original size, then the data is still there. But if you can't guarantee that your cluster was in a fully-repaired state in the first place and cleanup was run, it's possible that not all replicas would have made it to their proper nodes. But like nodetool getendpoints the bootstrap process would have assumed that it was.
That's why it's important to ensure that you have been regularly running nodetool repair on your cluster before running nodetool cleanup.
nodetool cleanup frees partition keys no longer belonging to a node, so after adding a node and transferring it's portion of data, this "portion" is no longer belongs to the old node, so running cleanup will free some space on this node.
If you see that old node now have lower storage, it is ok, there wasn't any data loss.
On other hand, if you really can't find some data, it can be due to data corruption or deleted data (with tombstones). What do you mean by data loss anyway?

Clarifications about nodetool repair -pr

From the documentation:
Using the nodetool repair -pr (–partitioner-range) option repairs only the primary range for that node, the other replicas for that range still have to perform the Merkle tree calculation, causing a validation compaction. Because all the replicas are compacting at the same time, all the nodes may be slow to respond for that portion of the data.
There is probably never a time where I can accept all nodes to be slow for a certain portion of the data. But I wonder: Why does it do that (or is there maybe just a mixup with the "-par" option in the documentation?!), when nodetool repair seems to be smarter:
By default, the repair command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots. For example, if you have RF=3 and A, B and C represents three replicas, this command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots (A<->B, A<->C, B<->C) instead of repairing A, B, and C all at once. This allows the dynamic snitch to maintain performance for your application via the other replicas, because at least one replica in the snapshot is not undergoing repair.
However, the datastax blog addresses this issue:
This first phase can be intensive on disk io, however. You can mitigate this to some degree with compaction throttling (since this phase is what we call a validation compaction.) Sometimes that isn’t enough though, and some people try to mitigate this further by using the -pr (–partitioner-range) option to nodetool repair, which repairs only the primary range for that node. Unfortunately, the other replicas for that range will still have to perform the Merkle tree calculation, causing a validation compaction. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data. Fortunately, there is way around this by using the -snapshot option.
That could be nice, but actually, there is no -snapshot option for nodetool repair (see the manpage, or the documentation) (has this option been removed?!)
So overall,
I cannot use nodetool repair -pr, it seems, because I always need at least to keep the system responsive enough to read/write with consistency ONE, without significant delay. (Note: We have only one data center.) Or am I missing/misunderstanding something?
Why is nodetool repair smart, keeping one node responsive, while nodetool repair -pr makes all nodes slow for a portion of data?
Where is the -snapshot option: Has it been removed, never implemented, or does it now maybe automatically work like that, also when using nodetool repair -pr?
The blog below addresses these issues:
http://www.datastax.com/dev/blog/repair-in-cassandra
A simple nodetool repair will not only kick off a repair on the node itself but also all the nodes that hold replicas if its ranges. While this is ok, it is very expensive and typically not an operation you'll carry out on a busy production system during peak times.
Consequently nodetool repair -pr will carry out a repair of the primary ranges on that node. You will need to run this on every node of the cluster as the blog says. Customers with large production systems will typically use this in a rolling fashion across their cluster.
On another note Datastax OpsCenter offers the repair service which runs smaller sub-range repairs all the time so although you're always repairing its going on in the background all the time at a lower resource level.
As for the snapshots, running a regular repair will invoke a snapshot as you stated, you can also invoke a snapshot yourself using nodetool snapshot
Hope this helps!

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

Resources