Cassandra adding new Datacenter with even token distribution

Cassandra adding new Datacenter with even token distribution - cassandra

We have a 1 DC cluster running Cassandra 3.11. The DC has 8 nodes total with 16 tokens per node and 3 seed nodes. We use Murmur3Partitioner.
In order to ensure better data distribution for the upcoming cluster in another DC, we want to use the tokens allocation approach where you manually specify initial_token for seed nodes and use allocate_tokens_for_keyspace for non seed nodes.
The problem is that our current datacenter cluster is not well balanced, since we built the cluster without a tokens allocation approach. So currently this means that the tokens are not well distributed. I can't figure out how to calculate initial_token for the new seed nodes in the new Datacenter. I probably cannot consider the token range of the new cluster as independent and calculate the initial token range as I would for a fresh cluster. At this point I am very unsure how to proceed. Any help will be appreciated, thanks.
Currently, I am trying to make a concept of migration and have come to the part where I do not know what to do and the documentation is not helpful.

There are scripts available to calculate the initial_token value, for example, you could use the one here to quickly calculate these values:
https://www.geroba.com/cassandra/cassandra-token-calculator/
You do have the ability to set allocate_tokens_for_keyspace and point it to a keyspace with a replication factor you plan to use for user-created keyspaces in the cluster, if you're adding a new DC, then you probably already have such a keyspace, and this should help you get better distribution. Remember to set this before bootstrapping nodes to the new DC.
Another option would be to avoid using vnodes entirely and go with single token architecture by setting num_tokens to 1. This gives you the ability to bootstrap nodes to the new DC, load/stream data and then monitor the distribution and make changes as needed using 'nodetool move':
https://cassandra.apache.org/doc/3.11/cassandra/tools/nodetool/move.html
This method would require you to monitor the distribution and make changes to the token assignments as needed, and you'd want to follow-up the move command with 'nodetool repair' and 'nodetool cleanup' on all nodes, but it gives you the ability to rectify uneven distribution quickly without bootstrapping new nodes. You would still want to use the same method for calculating the initial_token values with single-token architecture and set that before bootstrap.
I suspect either method could work well for you, but wanted to give you a second option.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.

Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.

To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Why and when to use Vnodes in Cassandra in real life production scenarios?

I understand that you don't have to rebalance the vnodes but when do we really use
it in production scenarios? does it function the same way as a physical single token node? If so, then why use single token nodes at all? Does vnodes help if I have large amount data and the cluster size (say 300 nodes)?

The main benefit of using vnodes is more evenly distributed data being streamed when bootstrapping a new node. Why? Well, when adding a new node, it will request for the data in its token range. Optimally, the data it requests would be spread out evenly across all nodes reducing the workload for all of the nodes sending the data to the bootstrapping node (and also speeding up the bootstrap process).
Once you have a high number of physical nodes, like your example of 300, it would seem this benefit would be reduced (assuming no hot spotting or data partitioning issues). I'm not aware of an actual guidelines referencing the number of nodes to use or not use vnodes other than what is in the documentation. Yes, it is seen in production.
More information can be found here:
http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/config/configVnodes.html

In addition to Chris' excellent answer, I'll make an addition. When you have a large cluster with vnodes, it is helpful to let Cassandra manage the token ranges. Without vnodes, you would end up having to size and re-specify the token range for each (existing and) new node yourself. With vnodes, Cassandra handles that for you.
Compare the difference in the steps listed in the documentation:
Adding a node without vnodes: http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsAddRplSingleTokenNodes.html
vs.
Adding with vnodes: http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html

cassandra cluster, 1 table, how to plan forward

I am planning to create an application that will use just 1 cassandra table. Replication factor will be probably 2 or 3. I might start initially with 2 cassandra server and then keep adding servers as needed. But I am not sure if I need to pre-plan anything so that the table is distributed uniformly when I add more servers. Are there any best practices or things I need to be aware? I read about tokens , http://www.datastax.com/docs/1.1/initialize/token_generation , but I am not sure what I need to do.
I suppose the keys have to be distrubuted uniformly in the cluster, so:
how will that happen i.e. when I add the 2nd server and say the 1st one already has 1 million keys
do I need to pre-plan the keyspace or tables?

I can suggest two things.
First, when designing your schema, pick a good partition key (1st column in the primary key). You need to ensure a couple of things:
There are enough values such that you can distribute it to an arbitrary amount of nodes. For example, sex would be a bad partition key, because you only have two values and therefore can only distribute it to two nodes.
The distribution across different partition key values is more or less uniform. For example, country might not be best, because you will most likely have most of your rows in just a few unique countries.
Secondly, to ease deployment of new nodes later consider setting up your cluster to use virtual nodes (vnodes). If you do that you will be able to skip a few steps when expanding your cluster.
To configure virtual nodes, set num_tokens in cassandra.yaml to more than 1. This will decide how many virtual nodes your node will have. A recommended value is 256.
Later, when you add new nodes, you need to make sure add_bootstrap is true in cassandra.yaml for your new nodes. Then you configure network parameters as usual to match your cluster, and finally start your node. It should automatically bootstrap and start streaming appropriate data. After everything is settled down, you can run cleanup (nodetool clean) on your other nodes to make sure they purge redundant data that they're no longer responsible for.
For more detailed documentation, please see http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html

What would be the exact procedure to add new nodes to a Cassandra cluster so that the cluster remains balanced?

I've read the relevant documentation I could find, but I still have doubts.
What I read
From http://wiki.apache.org/cassandra/Operations#Moving_nodes
If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
and from http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity-to-an-existing-cluster
If you need to increase capacity by a non-uniform number of nodes, you must recalculate tokens for the entire cluster, and then use nodetool move to assign the new tokens to the existing nodes. After all nodes are restarted with their new token assignments, run a nodetool cleanup to remove unused keys on all nodes
But I'm not clear on the order of these things.
Could you explain how to do it in the following scenario?
I'm using cassandra 1.1.9, so no virtual nodes are in use.
I have a cluster ring with 5 nodes, and each owns 20%
Their tokens are
0
34028236692093846346337460743176821145
68056473384187692692674921486353642291
102084710076281539039012382229530463436
136112946768375385385349842972707284582
I want to add 2 additional nodes.
What steps do I have to follow? I know I should install and configure cassandra, use the original 5 as seeds, and calculate their new tokens, but in what order should I move the data using nodetool move? Is it one at a time?
What happens with the data when I move the first one? Is it available at all times?
Should I start the two new nodes before moving the original 5 to their new tokens?
A step by step guide would be ideal.
Please note that I need to do it pre version 1.2

The new tokens should be
0
24305883351495604533098186245126300818
48611766702991209066196372490252601636
72917650054486813599294558735378902454
97223533405982418132392744980505203272
121529416757478022665490931225631504090
145835300108973627198589117470757804908
calculated using 2^127/7 * {0-7}.
What steps do I have to follow?
in what order should I move the data using nodetool move?
You should
Bootstrap in one node at 48611766702991209066196372490252601636
Bootstrap the other node at 121529416757478022665490931225631504090
Move 34028236692093846346337460743176821145 to 24305883351495604533098186245126300818
Move 68056473384187692692674921486353642291 to 72917650054486813599294558735378902454
Move 102084710076281539039012382229530463436 to 97223533405982418132392744980505203272
Move 136112946768375385385349842972707284582 to 145835300108973627198589117470757804908
(I tried to minimise the amount of data transferred - might not be optimal but is close enough to not make much difference given the inbalance of data you probably have already.)
Is it one at a time?
You should bootstrap one node and once and move one token at once. This avoids placing excess load on the cluster while streaming data.
What happens with the data when I move the first one? Is it available at all times?
Data is fully available during the move. The node participates in reads and writes for the old and new range so you can read and write during the move.
Should I start the two new nodes before moving the original 5 to their new tokens?
Always better to have more nodes in the cluster - if you moved first, you'd have some nodes with twice as much data as the others.

From Cassandra 1.2, keeping a cluster balanced when adding nodes is very easy, because of the new vnodes (multiple seeds per node) feature. Cassandra now automatically balances the cluster for you. If you upgrade from an earlier version you will have to activate the vnode feature yourself

how to efficiently manage cassandra initial token?

I'm new cassandra user. I know that there is initial token configuration and how to generate it.
The question is if I have an existen cluster with x nodes and I want to add additional node (one or more) should I reconfigure all the nodes to the new tokens (according to new generated values)?
Or is there more efficient way to manage this?

If you're looking for what the best practices are for handling such tasks, take a look at this section of the Cassandra 1.0 docs dedicated to token strategy.
Shortened version of your options, from the documentation:
Add capacity by doubling the cluster size -- [..] nodes can keep their existing token assignments, and new nodes are assigned tokens that bisect (or trisect) the existing token ranges.
Recalculate new tokens for all nodes and move nodes -- [..] you will have to recalculate tokens for the entire cluster. Existing nodes will have to have their new tokens assigned using nodetool move.
Add one node at a time and leave initial_token empty -- [..] splits the token range of the heaviest loaded node and places the new node into the ring at that position. [..] not result in a perfectly balanced ring, but it will alleviate hot spots.
link
If you were seeking a management solution Priam (from Netflix) might be worth looking at. It's open source and Apache-licensed, but requires some amount of configuration and is probably only worth investing [time] in for larger clusters.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string