What will happen if I reduce RF from 5 to 3 then decommission 2 of 5 nodes? - cassandra

I have a scylla cluster which has 5 nodes and my keyspace's replication factor is also 5. Now I want to remove 2 nodes from the cluster.
I plan to change the replication factor from 5 to 3, and decommission 2 nodes. Can I take this operation without a maintenance?
A safe and smooth online modification is what I expected ... 😄

Reducing the replica count doesn't require downtime. Decommissioning nodes doesn't require downtime either as long as there are sufficient replicas left in the ring plus there is sufficient capacity left to handle the application traffic.
The important thing to consider is that the cluster is repaired regularly (to make sure that the replicas are consistent) and that the nodes do not suffer from dropped mutations because it is an indication that nodes are overloaded.
If there isn't sufficient capacity BEFORE nodes are decommissioned, the situation will be even worse when there are less nodes because the capacity of the cluster will also drop so there will be a disruption to the normal operation of your application or worse, a service outage. Cheers!

Related

Apache Cassandra decommission second DC and join nodes into first DC as brand new nodes?

My Cassandra cluster consists of 2 DCs, each DC has 5 nodes and replication factor per DC is 3. Both DCs are hosted onto the same docker orchestrator. This is a legacy and probably it was done during last major system migration years ago. At the time being I don't see any advantage of having 2 DCs with same replication factor 3. This way same data is written 6 times. Cluster is at least 80% write heavy, reads are more or less limited.
Cassandra load is struggling at peak times, so I would like to have 1 DC with 10 nodes (instead of 2DCs x 5 ndoes) to be able to balance across 10 nodes, instead of just 5. This way I will also bring down data size per node. Having same amount of RAM and CPU dedicated to Cassandra, I would win performance and free storage space ;-)
So idea is to decommission DC2 and bring all 5 nodes from it to DC1 as brand new nodes.
Steps are known:
alter keyspaces to be limited to DC1 only.
no clients to be writing/reading to/from DC2 - DCAwarePolicy with LOCAL_*
I wonder about next step - it says I need to start decommissioning node by node DC2. Is this mandatory or I could somehow just take those nodes down? Goal is not to decommission some, but all nodes in a DC. If I decommission say node5, data would be transferred to remaining 4 nodes and so on. At some point I would be left with 3 nodes and replication factor 3, so I won't be able to decommission any further. What is more - I guess there would be no free space on those node volumes and I am not willing to extend those any further.
So my questions are:
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes? Or it strongly depends on the usage and the way services write and read data from Cassandra?
Appreciate your replies, thanks.
Cheers,
OvivO
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Yes, you can adjust the keyspace definition to just replicate within DC1. Since you're basically removing a DC, you could shut them all down, and run a nodetool removenode for each. In theory, that would remove the nodes from gossip and (if they're down) not attempt to move data around. Then yes, add each node back to DC1, one at a time. Once you're done, run a repair, followed by a nodetool cleanup on each node.
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
No. Token range assignment is DC dependent. If they moved to a new DC, their range assingments would change, and the nodes would very likely be responsible for different ranges of data.
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes?
Geographic awareness. If you have a mobile app and users on both the West Coast and East Coast, you don't want your East Coast users making a call for data all the way to the West Coast. You want that data call to happen as locally as possible. So, you'd build up a DC on each coast, and let Cassandra keep them in-sync.

Can Cassandra cluster have even number of nodes?

Currently running a 3 node cluster with replication factor 3 on the keyspaces. Need to add more nodes to the cluster as the size of each node is approaching 2TB.
Can I add just 1 more node to the cluster and have a 4 node cluster or does the cluster always need to have odd number of nodes? Using a consistency level of ONE currently for both read and write.
You can have as many nodes in the cluster as you want, particularly if you are not using the racks feature in Cassandra (all nodes are in the same logical C* rack).
If you are using C* racks, our recommendation is to have an equal number of nodes in each rack so the load distribution is balanced across the racks in each DC.
For example, if your app keyspaces have a replication factor of 3 and you have 3 racks then the number of nodes in the DC should be in multiples of the replication factor -- 3, 6, 9, 12 and so on. This would allow you to configure the same number of nodes in each rack.
This isn't a hard requirement but is best practice so nodes have an equal amount of load and data on them. Cheers!
You can have even number of nodes in a Cassandra cluster. So you can add another node to the cluster. If you are using vnodes, then it will be easier, otherwise a lot of work needs to be done to balance the cluster.
One more thing, reading and writing with consistency level ONE decreases the consistency. If it suits your usecase then it is fine but general recommendation is to use QUORUM on the production system.

Cassandra cluster scaling down

I have a 3-node cassandra cluster in aws cloud which is running perfectly.
The traffic is low and I want to scale it down to two or single node due to economic constraints.
What could be the right practice here? Can I pause the other 2 nodes?
Is some data loss is expected?
If the cassandra nodes are available and you decommission them "gracefully", no data loss occurs. The reason is because when you decommission nodes token/data re-distribution occurs (so the process takes some time). If you "hard force" a node out (or if it becomes unavailable for any reason) and your RF is not configured to have data redundancy (e.g. set to 1), you will lose data. So try to remove the node "gracefully" (nodetool decommission (not sure how that's done in AWS)) and when you're done, be sure your RF settings per keyspace are correct (i.e. don't have RF > nodes and be sure it's > 1 if you want redundancy).
-Jim

Do i need decease my replication factor if replication > no.of nodes ( planning to decommission node)

Now presently we are having a Dc with 3 nodes and with a replication of 3, i am planning to decommission a node do i need to decrease my replication to 2 or just decommissioning node will adjust the data among the two nodes with a replication of 3??
Decommissioning a node will not cause your Cassandra cluster to break necessarily, but it will make it so that a few things will stop working.
A few things that will happen if you decommission the node but don't adjust the replication factor:
First, nothing about your replication factor will be changed just because you decommission a node. To do otherwise would cause chaos.
Queries (both read and write) that attempt to use ConsistencyLevel.ALL will fail, because they will not be able to get 3 machines to participate
Queries with ConsistencyLevel.QUORUM will be less available, because BOTH remaining machines will need to respond to queries to meet quorum.
Because you have 3 machines and a RF of 3, that means that every machine has a complete copy of the data. Decommission the node, update your replication factor, and then run nodetool repair on the remaining two nodes. After you do that, you should be good to go.
My 2 cents: I would suggest you to first change your replication to 2, run a repair on all nodes and then issue "nodetool decommission" from the node you want to decommission. There will be data moving around, but by doing it this way nothing should stop working.

How to force Cassandra not to use the same node for replication in a schema with vnodes

Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

Resources