Can I upgrade a Cassandra cluster swapping in new nodes running the updated version? - cassandra

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.

Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.

To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Related

Cassandra adding new Datacenter with even token distribution

We have a 1 DC cluster running Cassandra 3.11. The DC has 8 nodes total with 16 tokens per node and 3 seed nodes. We use Murmur3Partitioner.
In order to ensure better data distribution for the upcoming cluster in another DC, we want to use the tokens allocation approach where you manually specify initial_token for seed nodes and use allocate_tokens_for_keyspace for non seed nodes.
The problem is that our current datacenter cluster is not well balanced, since we built the cluster without a tokens allocation approach. So currently this means that the tokens are not well distributed. I can't figure out how to calculate initial_token for the new seed nodes in the new Datacenter. I probably cannot consider the token range of the new cluster as independent and calculate the initial token range as I would for a fresh cluster. At this point I am very unsure how to proceed. Any help will be appreciated, thanks.
Currently, I am trying to make a concept of migration and have come to the part where I do not know what to do and the documentation is not helpful.
There are scripts available to calculate the initial_token value, for example, you could use the one here to quickly calculate these values:
https://www.geroba.com/cassandra/cassandra-token-calculator/
You do have the ability to set allocate_tokens_for_keyspace and point it to a keyspace with a replication factor you plan to use for user-created keyspaces in the cluster, if you're adding a new DC, then you probably already have such a keyspace, and this should help you get better distribution. Remember to set this before bootstrapping nodes to the new DC.
Another option would be to avoid using vnodes entirely and go with single token architecture by setting num_tokens to 1. This gives you the ability to bootstrap nodes to the new DC, load/stream data and then monitor the distribution and make changes as needed using 'nodetool move':
https://cassandra.apache.org/doc/3.11/cassandra/tools/nodetool/move.html
This method would require you to monitor the distribution and make changes to the token assignments as needed, and you'd want to follow-up the move command with 'nodetool repair' and 'nodetool cleanup' on all nodes, but it gives you the ability to rectify uneven distribution quickly without bootstrapping new nodes. You would still want to use the same method for calculating the initial_token values with single-token architecture and set that before bootstrap.
I suspect either method could work well for you, but wanted to give you a second option.

Cassandra: How to find node with matching token for restoring to newer cluster?

I want to restore data from an existing cluster to newer cluster. I want to do so using the method, that of, copying the snapshot SSTables from old cluster to keyspaces of newer cluster, as explained in http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html.
The same document says, " ... the snapshot must be copied to the correct node with matching tokens". What does it really mean by "node with matching tokens"?
My current cluster is of 5 nodes and for each node num_tokens: 256. I am gonna create another cluster with same no of nodes and num_tokens and same schema. Do I need to follow the ring order while copying SSTables to newer cluster? How do I find matching target node for a given source node?
I tried command "nodetool ring" to check if I can use token values to match. But this command gives all the tokens for each host. How can I get the single token value (which determines the position of the node in the ring)? If I can get it, then I can find the matching nodes as well.
With vnodes its really hard to copy the sstables over correctly because its not just one assigned token that you have to reassign, but 256. To do what your asking you need to do some additional steps described http://datascale.io/cloning-cassandra-clusters-fast-way/. Basically reassign the 256 tokens of each node to a new node in other cluster so the ring is the same. The article you listed describes loading it on the same cluster which is a lot simpler because you dont have to worry about different topologies. Worth noting that even in that scenario, if a new node was added or a node was removed since the snapshot it will not work.
Safest bet will be to use sstableloader, it will walk through the sstable and distribute the data in the appropriate node. It will also open up possibility of making changes without worrying if everything is correct. Also it ensures everything is on the correct nodes so no worries about human errors. Each node in the original cluster can just run sstableloader on each sstable to the new cluster and you will parallelize the work pretty well.
I would strongly recommend you use this opportunity to decrease the number of vnodes to 32. The 256 default is excessive and absolutely horrible for rebuilds, solr indexes, spark, and most of all it ruins repairs. Especially if you use incremental repairs (default), the additional ranges will cause much more anticompactions and load. If you use sstableloader on each sstable it will just work. Increasing your streaming throughput in the cassandra.yaml will potentially speed this up a bit as well.
If by chance your using OpsCenter this backup and restore to new cluster is automated as well.

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

DynamicSnitch Reads from empty new datacenter

When adding a new datacenter the dynamicSnitch causes us to read data from the new dc when the data is not there yet.
We have a cassandra (1.0.11) cluster running on 3 datacenters and we want to add a forth datacenter. The cluster is configured with PropertyFileSnitch and DynamicSnitch enabled with 0.0 badness factor. The relevant keyspaces replication factor are DC1:2, DC2:2, DC3:2. Our plan was to add the new datacenter to the ring, add it to the schema and run a rolling repair -pr on all the nodes so the new nodes will get all of their needed data.
Once we started the process we noticed that the new datacenter recieves read calls from the other data centers because it has a lower load and the DynamicSnitch decides it will be better to read from it. The problem is that the data center still doesn't have the data and returns no results.
We tried removing the DynamicSnitch entirely but once we did that every time a single server got a bit of load we experience extreme performance degredation.
Have anyone encountered this issue ?
Is there a way to directly influence the score of a specific data center so it won't be picked by the DynamicSnitch ?
Are there any better ways to add a datacenter in cassandra 1.0.11 ? Have anyone written a snitch that handles these issues ?
Thanks,
Izik.
You could bootstrap the nodes instead of adding to the ring without bootstrap and then repairing. The former ensures that no reads will be routed to it until it has all the data it needs. (That is why Cassandra defaults to auto_bootstrap: true and in fact disabling it is a sufficiently bad idea that we removed it from the example cassandra.yaml.)
The problem with this, and the reason that the documentation recommends adding all the nodes first without bootstrap, is that if you have N replicas configured for DC4, Cassassandra will try to replicate the entire dataset for that keyspace to the first N nodes you add, which can be problematic!
So here are the options I see:
If your dataset is small enough, go ahead and use the bootstrap plan
Increase ConsistencyLevel on your reads so that they will always touch a replica that does have the data, as well as one that does not
Upgrade to 1.2 and use ConsistencyLevel.LOCAL_ONE on your reads which will force it to never make cross-DC requests

What would be the exact procedure to add new nodes to a Cassandra cluster so that the cluster remains balanced?

I've read the relevant documentation I could find, but I still have doubts.
What I read
From http://wiki.apache.org/cassandra/Operations#Moving_nodes
If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
and from http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity-to-an-existing-cluster
If you need to increase capacity by a non-uniform number of nodes, you must recalculate tokens for the entire cluster, and then use nodetool move to assign the new tokens to the existing nodes. After all nodes are restarted with their new token assignments, run a nodetool cleanup to remove unused keys on all nodes
But I'm not clear on the order of these things.
Could you explain how to do it in the following scenario?
I'm using cassandra 1.1.9, so no virtual nodes are in use.
I have a cluster ring with 5 nodes, and each owns 20%
Their tokens are
0
34028236692093846346337460743176821145
68056473384187692692674921486353642291
102084710076281539039012382229530463436
136112946768375385385349842972707284582
I want to add 2 additional nodes.
What steps do I have to follow? I know I should install and configure cassandra, use the original 5 as seeds, and calculate their new tokens, but in what order should I move the data using nodetool move? Is it one at a time?
What happens with the data when I move the first one? Is it available at all times?
Should I start the two new nodes before moving the original 5 to their new tokens?
A step by step guide would be ideal.
Please note that I need to do it pre version 1.2
The new tokens should be
0
24305883351495604533098186245126300818
48611766702991209066196372490252601636
72917650054486813599294558735378902454
97223533405982418132392744980505203272
121529416757478022665490931225631504090
145835300108973627198589117470757804908
calculated using 2^127/7 * {0-7}.
What steps do I have to follow?
in what order should I move the data using nodetool move?
You should
Bootstrap in one node at 48611766702991209066196372490252601636
Bootstrap the other node at 121529416757478022665490931225631504090
Move 34028236692093846346337460743176821145 to 24305883351495604533098186245126300818
Move 68056473384187692692674921486353642291 to 72917650054486813599294558735378902454
Move 102084710076281539039012382229530463436 to 97223533405982418132392744980505203272
Move 136112946768375385385349842972707284582 to 145835300108973627198589117470757804908
(I tried to minimise the amount of data transferred - might not be optimal but is close enough to not make much difference given the inbalance of data you probably have already.)
Is it one at a time?
You should bootstrap one node and once and move one token at once. This avoids placing excess load on the cluster while streaming data.
What happens with the data when I move the first one? Is it available at all times?
Data is fully available during the move. The node participates in reads and writes for the old and new range so you can read and write during the move.
Should I start the two new nodes before moving the original 5 to their new tokens?
Always better to have more nodes in the cluster - if you moved first, you'd have some nodes with twice as much data as the others.
From Cassandra 1.2, keeping a cluster balanced when adding nodes is very easy, because of the new vnodes (multiple seeds per node) feature. Cassandra now automatically balances the cluster for you. If you upgrade from an earlier version you will have to activate the vnode feature yourself

Resources