Best way to add multiple nodes to existing cassandra cluster - cassandra

We have a 12 node cluster with 2 datacenters(each DC has 6 nodes) with RF-3 in each DC.
We are planning to increase cluster capacity by adding 3 nodes in each DC(total 6 nodes).
What is the best way to add multiple nodes at once.(ya,may be with 2 min difference).
auto_bootstrap:false - Use auto_bootstrap:false(as this is quicker process to start nodes) on all new nodes , start all nodes & then run 'nodetool rebuild' to get data streamed to this new nodes from exisitng nodes.
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
OR
auto_bootstrap:true - Use auto_bootstrap:true and then start one node at a time , wait until streaming process finishes(this might take time I guess as we have huge data approx 600 GB+ on each node) before starting next node.
If I go this way , I have to wait until whole streaming process done on a node before proceed adding next new node.
Kindly suggest a best way to add multiple nodes all at once.
PS: We are using c*-2.0.3.
Thanks in advance.

As each depends on streaming data over the network, it largely depends how distributed your cluster is, and where your data currently is.
If you have a single-DC cluster and latency is minimal between all nodes, then bringing up a new node with auto_bootstrap: true should be fine for you. Also, if at least one copy of your data has been replicated to your local datacenter (the one you are joining the new node to) then this is also the preferred method.
On the other hand, for multiple DCs I have found more success with setting auto_bootstrap: false and using nodetool rebuild. The reason for this, is because nodetool rebuild allows you to specify a data center as the source of the data. This path gives you the control to contain streaming to a specific DC (and more importantly, not to other DCs). And similar to the above, if you are building a new data center and your data has not yet been fully-replicated to it, then you will need to use nodetool rebuild to stream data from a different DC.
how read requests would be handled ?
In both of these scenarios, the token ranges would be computed for your new node when it joins the cluster, regardless of whether or not the data is actually there. So if a read request were to be sent to the new node at CL ONE, it should be routed to a node containing a secondary replica (assuming RF>1). If you queried at CL QUORUM (with RF=3) it should find the other two. That is of course, assuming that the nodes which are picking-up the slack are not overcome by their streaming activities that they cannot also serve requests. This is a big reason why the "2 minute rule" exists.
The bottom line, is that your queries do have a higher chance of failing before the new node is fully-streamed. Your chances of query success increase with the size of your cluster (more nodes = more scalability, and each bears that much less responsibility for streaming). Basically, if you are going from 3 nodes to 4 nodes, you might get failures. If you are going from 30 nodes to 31, your app probably won't notice a thing.
Will the new node try to pull data from nodes in the other data centers too?
Only if your query isn't using a LOCAL consistency level.

I'm not sure this was ever answered:
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
And the answer is yes. The new node will join the cluster, receive the token assignments, but since auto_bootstrap: false, the node will not receive any streamed data. Thus, it will be a member of the cluster, but will not have any old data. New writes will be received and processed, but existing data prior to the node joining, will not be available on this node.
With that said, with the correct CL levels, your new node will still do background and foreground read repair, so that it shouldn't respond any differently to requests. However, I would not go this route. With 2 DC's, I would divert traffic to DCA, add all of the nodes with auto_bootstrap: false to DCB, and then rebuild the nodes from DCA. The rebuild will need to be from DCA because the tokens have changed in DCB, and with auto_bootstrap: false, the data may no longer exist. You could also run repair, and that should resolve any discrepancies as well. Lastly, after all of the nodes have been bootstrapped, run cleanup on all of the nodes in DCB.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

How to Manage Node Failure with Cassandra Replication Factor 1?

I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.
(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

DynamicSnitch Reads from empty new datacenter

When adding a new datacenter the dynamicSnitch causes us to read data from the new dc when the data is not there yet.
We have a cassandra (1.0.11) cluster running on 3 datacenters and we want to add a forth datacenter. The cluster is configured with PropertyFileSnitch and DynamicSnitch enabled with 0.0 badness factor. The relevant keyspaces replication factor are DC1:2, DC2:2, DC3:2. Our plan was to add the new datacenter to the ring, add it to the schema and run a rolling repair -pr on all the nodes so the new nodes will get all of their needed data.
Once we started the process we noticed that the new datacenter recieves read calls from the other data centers because it has a lower load and the DynamicSnitch decides it will be better to read from it. The problem is that the data center still doesn't have the data and returns no results.
We tried removing the DynamicSnitch entirely but once we did that every time a single server got a bit of load we experience extreme performance degredation.
Have anyone encountered this issue ?
Is there a way to directly influence the score of a specific data center so it won't be picked by the DynamicSnitch ?
Are there any better ways to add a datacenter in cassandra 1.0.11 ? Have anyone written a snitch that handles these issues ?
Thanks,
Izik.
You could bootstrap the nodes instead of adding to the ring without bootstrap and then repairing. The former ensures that no reads will be routed to it until it has all the data it needs. (That is why Cassandra defaults to auto_bootstrap: true and in fact disabling it is a sufficiently bad idea that we removed it from the example cassandra.yaml.)
The problem with this, and the reason that the documentation recommends adding all the nodes first without bootstrap, is that if you have N replicas configured for DC4, Cassassandra will try to replicate the entire dataset for that keyspace to the first N nodes you add, which can be problematic!
So here are the options I see:
If your dataset is small enough, go ahead and use the bootstrap plan
Increase ConsistencyLevel on your reads so that they will always touch a replica that does have the data, as well as one that does not
Upgrade to 1.2 and use ConsistencyLevel.LOCAL_ONE on your reads which will force it to never make cross-DC requests

What would be the exact procedure to add new nodes to a Cassandra cluster so that the cluster remains balanced?

I've read the relevant documentation I could find, but I still have doubts.
What I read
From http://wiki.apache.org/cassandra/Operations#Moving_nodes
If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
and from http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity-to-an-existing-cluster
If you need to increase capacity by a non-uniform number of nodes, you must recalculate tokens for the entire cluster, and then use nodetool move to assign the new tokens to the existing nodes. After all nodes are restarted with their new token assignments, run a nodetool cleanup to remove unused keys on all nodes
But I'm not clear on the order of these things.
Could you explain how to do it in the following scenario?
I'm using cassandra 1.1.9, so no virtual nodes are in use.
I have a cluster ring with 5 nodes, and each owns 20%
Their tokens are
0
34028236692093846346337460743176821145
68056473384187692692674921486353642291
102084710076281539039012382229530463436
136112946768375385385349842972707284582
I want to add 2 additional nodes.
What steps do I have to follow? I know I should install and configure cassandra, use the original 5 as seeds, and calculate their new tokens, but in what order should I move the data using nodetool move? Is it one at a time?
What happens with the data when I move the first one? Is it available at all times?
Should I start the two new nodes before moving the original 5 to their new tokens?
A step by step guide would be ideal.
Please note that I need to do it pre version 1.2
The new tokens should be
0
24305883351495604533098186245126300818
48611766702991209066196372490252601636
72917650054486813599294558735378902454
97223533405982418132392744980505203272
121529416757478022665490931225631504090
145835300108973627198589117470757804908
calculated using 2^127/7 * {0-7}.
What steps do I have to follow?
in what order should I move the data using nodetool move?
You should
Bootstrap in one node at 48611766702991209066196372490252601636
Bootstrap the other node at 121529416757478022665490931225631504090
Move 34028236692093846346337460743176821145 to 24305883351495604533098186245126300818
Move 68056473384187692692674921486353642291 to 72917650054486813599294558735378902454
Move 102084710076281539039012382229530463436 to 97223533405982418132392744980505203272
Move 136112946768375385385349842972707284582 to 145835300108973627198589117470757804908
(I tried to minimise the amount of data transferred - might not be optimal but is close enough to not make much difference given the inbalance of data you probably have already.)
Is it one at a time?
You should bootstrap one node and once and move one token at once. This avoids placing excess load on the cluster while streaming data.
What happens with the data when I move the first one? Is it available at all times?
Data is fully available during the move. The node participates in reads and writes for the old and new range so you can read and write during the move.
Should I start the two new nodes before moving the original 5 to their new tokens?
Always better to have more nodes in the cluster - if you moved first, you'd have some nodes with twice as much data as the others.
From Cassandra 1.2, keeping a cluster balanced when adding nodes is very easy, because of the new vnodes (multiple seeds per node) feature. Cassandra now automatically balances the cluster for you. If you upgrade from an earlier version you will have to activate the vnode feature yourself

Resources