How to restart a seed node after its process crashes? - cassandra

Is there any differences between replacing a dead node and restarting a dead node, specially for seed nodes ? Actually, I'm a little bit confused about how to restart a dead seed node.
When the process of a seed node crashes, should I restart it without doing any changes to cassandra.yaml ? Or, like replacing a seed node, should I remove its IP address from the seeds list (cassandra.yml) on each node ?
The documentation is not clear about that. It only deals about how to replace a dead node by another machine.
Thanks you

If you are simply restarting a dead seed node, then you shouldn't need to alter your cassandra.yaml file before the restart. As long as you have addressed whatever caused the node to die, and your node has not been down longer than gc_grace_seconds (see note below), then restarting shouldn't be an issue.
The concerns noted in the documentation you have linked center around replacing dead seed nodes. The problem with replacing seed nodes, is that the new node will not bootstrap into the cluster if it is configured as a seed. In that case, a different node in the cluster should be promoted to be a seed node.
Note: the About Deletes section of the documentation warns about bringing a node back that has been down a long time. Specifically, longer than the value set for gc_grace_seconds (or the shortest value set, if you have changed it on any individual tables).
...if a node is down
longer than the grace period, the node can miss the delete because the
tombstone disappears after gc_grace_seconds. Cassandra always attempts
to replay missed updates when the node comes back up again. After a
failure, it is a best practice to run node repair to repair
inconsistencies across all of the replicas when bringing a node back
into the cluster. If the node doesn't come back within
gc_grace,_seconds, remove the node, wipe it, and bootstrap it again.

Related

How do I bring back a Cassandra 2.0 node that's been down for a long time

We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!

Avoid zombie data in cassandra

Recently I faced an issue in a customer setup with a 3 node cluster, where one node went down and came online only after 12 days. The default gc_grace_seconds for most of the table has been set to 1 day in our scenario and there are a lot of tables.
When this down node came up, stale data from this node got replicated to the other nodes leading to zombie data in all the three nodes.
A solution that I could think of was to clean the node before making it join the cluster and then run a repair which could prevent the occurrence of zombie data.
Could there be any other possible solution to avoid this issue where I don't need to clean the node.
You should never bring a node back online if it has been down for longer than the shortest gc_grace_seconds.
This is a challenge in environments where GC grace is set to a very low value. In these situations, the procedure is to completely rebuild the node as if it was never part of the cluster:
Completely wipe all contents of data/, commitlog/ and saved_caches/.
Remove the node's IP from its seeds list if it is listed as a seed node.
Replace the node with itself using the replace_address flag.
Cheers!

Why shouldn't you run nodetool removenode?

I am wondering why it is a best practice to not run nodetool removenode. What is it used for? Is there a hierarchy of commands to run instead? What kind of issues arise when running said command? Any first hand experience/nightmare stories of using removenode? Overarching why not?
A default preference order would be:
Replacement node option (if replacement is planned)
Decomission
RemoveNode
Assassinate
However - there are situations where you will still choose a lower entry over an earlier one.
If the node being removed is operational, then you would normally run a decommission and allow the node to stream the data from itself to other nodes which will now be holding one of the replicas that was previously on the node being removed.
Removing a node will cause the token ranges to be recalculated and move, potentially requiring all nodes to start streaming data to other nodes who now own that range.
If the node is not operational, you can perform a nodetool removenode - this will trigger the same range movement and cause a large amount of streaming. There are streaming throughput throttles that are in place by default and can be adjusted to limit this impact.
You can also forcibly terminate either a decommission or a removenode by using nodetool [decommission | removenode] force - however, this means that one of the replicas of the data has not been re-established to another node, leaving you with less resilience.
Why would you do that? for the same streaming reason, if you accept the loss of resilience for a period of time, you can roll out the repair node by node in a controlled manner. This option should not be considered your 'default approach' or the choice taken lightly - I cannot stress or make that in bold enough.
The final option, when decommission / removenode is not available, is to assassinate the node - this is pretty much the same as performing a removenode, followed by an immediate force. You then have to manage to repairs and clean up in the same manner.
Outside of all of these 3 options - the best option is if your intention is to replace the node, then performing a replacement instead of a remove / add is the winner - this will only then require the new node has data streamed to it from the other replica's and there is no further token ring range movement. Instructions here
If the data disks are available, it is also possible to bring the replacement up without streaming the data, instructions here
The Datastax Documentation goes over the use case for nodetool removenode in depth.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsRemoveNode.html
The gist of why it can be bad is this:
Warning: This command triggers cluster streaming. In large
environments, the additional streaming activity causes more pending
gossip tasks in the output of nodetool tpstats. Nodes can start to
appear offline and may need to be restarted to clear up the back log
of pending gossip tasks.
According to the document, this is when it should be used:
When the node is down and nodetool decommission cannot be used, use nodetool removenode. Run this command only on nodes that are down.

Replacing a seed node without removing it from seed list

I have a cassandra(version 2.2.8) cluster of total 6 nodes, where 3 nodes are seed nodes. One of the seed node recently went down. I need to replace that dead seed node. My cluster is setup in a way where it cannot survive a loss of more than 1 node. I read this documentation to replace the dead seed node.
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
As per the documentation, I am scared to remove the dead seed node from the seed list and do a rolling restart. If for any reason any node doesnt start, Ill lose the data.
How to approach this scenario? Is it ok to not remove the dead seed node from the seed list until the new node is fully up and running? As i already have two working seed nodes already present in the seed list. Please advice.
In short: Yes it is okay to wait with removing the seed node.
Explanation: Seed node configuration do two things:
When adding new nodes. The new node will read the seed configuration to get a first contact point to the Cassandra cluster. After the node has joined the cluster it will save information about all Cassandra nodes in it's system.peers table. For all future starts it will use this information to connect to the cluster, not the seed node configuration.
Cassandra also uses seeds as a way to improve gossip. Basically, seed nodes are more likely to receive gossip messages than normal nodes. This improves the speed at which nodes receive updates about other nodes, such as the status.
Losing a seed node will in your case only impact 2. Since you have two more seed nodes I don't see this as a big issue. I would still do a rolling restart on all nodes as soon as you have updated your seed configuration.

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

Resources