Cleaning and rejoining same node in cassandra cluster - cassandra

We have Cassandra-0.8.2 cluster of 24 nodes and replication factor 2 . One of the node is quite slow and most of sstables on this node is corrupt.(We are not able to run compaction and not even scrub)
So is it possible to clean the data,cache and commitlog directories for this node and restart with bootstrap=true? Will it help to get all the data stream back to this node?
If it is possible , is there anything that could create issue?What care should be taken to avoid any danger?

As long as you have your replication factor set to 2. you should not have a problem to clean up and restart the machine node. But it would take some time for the data to flow through, Mine took around 4 hours. A Best way to analyse this visually is to install Opscenter from the DataStax. Its a great tool. There is no danger. Let us know if you were able to succeed!
Also it is advisible to upgrade to Cassandra 1.0. it is much faster! & you will instantly notice the difference.

Related

How do I bring back a Cassandra 2.0 node that's been down for a long time

We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!

Avoid zombie data in cassandra

Recently I faced an issue in a customer setup with a 3 node cluster, where one node went down and came online only after 12 days. The default gc_grace_seconds for most of the table has been set to 1 day in our scenario and there are a lot of tables.
When this down node came up, stale data from this node got replicated to the other nodes leading to zombie data in all the three nodes.
A solution that I could think of was to clean the node before making it join the cluster and then run a repair which could prevent the occurrence of zombie data.
Could there be any other possible solution to avoid this issue where I don't need to clean the node.
You should never bring a node back online if it has been down for longer than the shortest gc_grace_seconds.
This is a challenge in environments where GC grace is set to a very low value. In these situations, the procedure is to completely rebuild the node as if it was never part of the cluster:
Completely wipe all contents of data/, commitlog/ and saved_caches/.
Remove the node's IP from its seeds list if it is listed as a seed node.
Replace the node with itself using the replace_address flag.
Cheers!

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

How to speedup the bootstrap of single node

I have a single node Cassandra installation on my development machine (and very little experience with Cassandra). I always had very few data in the node and I experienced no problems. I inserted about 9,000 elements in a table today to experiment with a real world use case. When I start up the node the boot time is extremely long now. I get this in system.log
Replaying /var/lib/cassandra/commitlog/CommitLog-3-1388134836280.log
...
Log replay complete, 9274 replayed mutations
That took 13 minutes and is hardly bearable. I wonder if there is a way to store data in such a way that can be read at once without replaying the log. After all 9,000 elements are nothing and there must be a quicker way to boot. I googled for hints and searched into Cassandra's documentation but I didn't find anything. It's obvious that I'm not looking for the right things, would anybody be so kind to point me to the right documents? Thanks.
There are a few things that might help. The most obvious thing you can do is flush the commit log before you shutdown Cassandra. This is a good idea to do in production too. Before I stop a Cassandra node in production I'll run the following commands:
nodetool disablethrift
nodetool disablegossip
nodetool drain
The first two commands gracefully shut down connections to clients connected to this node and then to other nodes in the ring. The drain command flushes memtables to disk (sstables). This should minimize what needs to be replayed on startup.
There are other factors that can make startup take a long time. Cassandra opens all the SSTables on disk at startup. So the more column families and SSTables you have on disk the longer it will take before a node is able to start serving clients. There was some work done in the 1.2 release to speed this up (so if you are not on 1.2 yet you should consider upgrading). Reducing the number of SSTables would probably improve your start time.
Since you mentioned this was a development machine I'll also give you my dev environment observations. On my development machine I do a lot of creating and dropping column families and key spaces. This can cause some of the system CFs to grow significantly and eventually cause a noticeable slowdown. The easiest way to handle this is to have a script that can quickly bootstrap a new database and blow away all the old data in /var/lib/cassandra.

Bring back a dead datacenter: repair or rebuild

I had Cassandra cluster running across two data centers, for some reason, one data center was taken down for a while, and now I'm planning to bring it back. I'm thinking about two approaches:
One is to start up all Cassandra nodes of this data center and run "nodetool repair " on each node one by one. But It looks like 'repair' takes long time. I had an experience to repair 6GB data on a node before, it took me 5 hours on one node (3 nodes cluster). I have much more data on the cluster now, can't image how long it will take.
So I'm thinking if I can run re-build instead of repair. I can delete all old data on this data center and re-build it as adding a new data center. But not sure if it works and how performance would be.
Any idea on it? Any suggestion would be appreciated. Thanks in advance.
If the data center was down for more than 10 days then rebuild is the only option. This has to do with the tombstones. I am not 100% sure how that works across different data centers, but if you have a server that was down for more than 10 days, any data that has been deleted in the live server has been tombstoned and kept there for 10 days, then removed completely. If all suddenly your downed server wakes up from the sleep, with all deleted data not tombstoned, then it will be repopulated back to the ring via read repair or regular repair operation.
Another thing to consider, is how much data has changed/deleted since the datacenter went down. If a lot, then it obviously it is less work to rebuild. If not then maybe repair will be faster.
You can create another datacenter, add nodes to it with auto_bootstrap: false and then run nodetool rebuild <live dc name>
Good luck!

Resources