Avoid zombie data in cassandra - cassandra

Recently I faced an issue in a customer setup with a 3 node cluster, where one node went down and came online only after 12 days. The default gc_grace_seconds for most of the table has been set to 1 day in our scenario and there are a lot of tables.
When this down node came up, stale data from this node got replicated to the other nodes leading to zombie data in all the three nodes.
A solution that I could think of was to clean the node before making it join the cluster and then run a repair which could prevent the occurrence of zombie data.
Could there be any other possible solution to avoid this issue where I don't need to clean the node.

You should never bring a node back online if it has been down for longer than the shortest gc_grace_seconds.
This is a challenge in environments where GC grace is set to a very low value. In these situations, the procedure is to completely rebuild the node as if it was never part of the cluster:
Completely wipe all contents of data/, commitlog/ and saved_caches/.
Remove the node's IP from its seeds list if it is listed as a seed node.
Replace the node with itself using the replace_address flag.
Cheers!

Related

How do I bring back a Cassandra 2.0 node that's been down for a long time

We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

What to do if node repair wasn't ran within GCGraceSeconds?

I don't believe any of my nodes have been down for an extended period of time, so I believe all of my deletes should have been replicated throughout all of them. However, I keep seeing recommendations as normal maintenance to run node repair within GCGraceSeconds. I don't believe node repair has ever been ran on my cluster (I inherited it a few months ago). Do I have anything to worry about? Will I have zombie data if I run node repair even if I haven't had any nodes down for an extended time?
My main question is - what can I do to get out of this state so I can start routinely running nodetool repair?
Cassandra has no 'normal' deletes as relational databases have. When you delete something Cassandra just adds some record which marking data as deleted, named 'tombstone'. Even if all of your tombstones are properly replicated they're still lives in your files, and can affect performance and even make some deleted records be 'alive' again.
In general, you need to run 'nodetool repair' on every node of your cluster regularly.
You can check details in the documentation.

How to restart a seed node after its process crashes?

Is there any differences between replacing a dead node and restarting a dead node, specially for seed nodes ? Actually, I'm a little bit confused about how to restart a dead seed node.
When the process of a seed node crashes, should I restart it without doing any changes to cassandra.yaml ? Or, like replacing a seed node, should I remove its IP address from the seeds list (cassandra.yml) on each node ?
The documentation is not clear about that. It only deals about how to replace a dead node by another machine.
Thanks you
If you are simply restarting a dead seed node, then you shouldn't need to alter your cassandra.yaml file before the restart. As long as you have addressed whatever caused the node to die, and your node has not been down longer than gc_grace_seconds (see note below), then restarting shouldn't be an issue.
The concerns noted in the documentation you have linked center around replacing dead seed nodes. The problem with replacing seed nodes, is that the new node will not bootstrap into the cluster if it is configured as a seed. In that case, a different node in the cluster should be promoted to be a seed node.
Note: the About Deletes section of the documentation warns about bringing a node back that has been down a long time. Specifically, longer than the value set for gc_grace_seconds (or the shortest value set, if you have changed it on any individual tables).
...if a node is down
longer than the grace period, the node can miss the delete because the
tombstone disappears after gc_grace_seconds. Cassandra always attempts
to replay missed updates when the node comes back up again. After a
failure, it is a best practice to run node repair to repair
inconsistencies across all of the replicas when bringing a node back
into the cluster. If the node doesn't come back within
gc_grace,_seconds, remove the node, wipe it, and bootstrap it again.

Cleaning and rejoining same node in cassandra cluster

We have Cassandra-0.8.2 cluster of 24 nodes and replication factor 2 . One of the node is quite slow and most of sstables on this node is corrupt.(We are not able to run compaction and not even scrub)
So is it possible to clean the data,cache and commitlog directories for this node and restart with bootstrap=true? Will it help to get all the data stream back to this node?
If it is possible , is there anything that could create issue?What care should be taken to avoid any danger?
As long as you have your replication factor set to 2. you should not have a problem to clean up and restart the machine node. But it would take some time for the data to flow through, Mine took around 4 hours. A Best way to analyse this visually is to install Opscenter from the DataStax. Its a great tool. There is no danger. Let us know if you were able to succeed!
Also it is advisible to upgrade to Cassandra 1.0. it is much faster! & you will instantly notice the difference.

Resources