Why cassandra nodes responsible for (N-1) ranges only? - cassandra

So I was reading the Cassandra paper. In the section 5.2 (replication), below is mentioned.
All nodes on joining the cluster contact the leader who tells
them for what ranges they are replicas for and leader makes
a concerted effort to maintain the invariant that no node
is responsible for more than N-1 ranges in the ring.
Why no node is responsible for more than N-1 ranges?

If I'm not mistaken, I think you're referring to Facebook's paper on Cassandra from 2009.
In that paper, there were references to Zookeeper which is used in Facebook's in-house usage of Cassandra and isn't relevant in the publicly available distributions of Apache Cassandra. Specifically, all Cassandra nodes are equal -- there is no concept of leader/followers, no primary/secondary, no master/worker.
In a single-node cluster, the one and only node is responsible for all N ranges in the ring. Extending that to a three-node cluster with a replication factor of 3, all nodes in the cluster have 100% ownership of the data so are responsible for all N ranges in the ring. Cheers!

Related

Can Cassandra cluster have even number of nodes?

Currently running a 3 node cluster with replication factor 3 on the keyspaces. Need to add more nodes to the cluster as the size of each node is approaching 2TB.
Can I add just 1 more node to the cluster and have a 4 node cluster or does the cluster always need to have odd number of nodes? Using a consistency level of ONE currently for both read and write.
You can have as many nodes in the cluster as you want, particularly if you are not using the racks feature in Cassandra (all nodes are in the same logical C* rack).
If you are using C* racks, our recommendation is to have an equal number of nodes in each rack so the load distribution is balanced across the racks in each DC.
For example, if your app keyspaces have a replication factor of 3 and you have 3 racks then the number of nodes in the DC should be in multiples of the replication factor -- 3, 6, 9, 12 and so on. This would allow you to configure the same number of nodes in each rack.
This isn't a hard requirement but is best practice so nodes have an equal amount of load and data on them. Cheers!
You can have even number of nodes in a Cassandra cluster. So you can add another node to the cluster. If you are using vnodes, then it will be easier, otherwise a lot of work needs to be done to balance the cluster.
One more thing, reading and writing with consistency level ONE decreases the consistency. If it suits your usecase then it is fine but general recommendation is to use QUORUM on the production system.

How to force Cassandra not to use the same node for replication in a schema with vnodes

Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

Does Cassandra support a percentage replication?

I know Cassandra has a replication factor for the keyspace but I was wondering if Cassandra had the ability to specify a replication percentage. Instead of specifying the number of nodes you want to replicate the data you could specify the percentage of the number of nodes you have.
This is not necessarily a question if I need this but I'm curious if Cassandra has this functionality or if Cassandra could support this.
No, Cassandra use only the replication factor as number nodes -- a percentage does not necessarily makes sense -- what if I choose 35% and I have a 4 nodes cluster? Each node can own a 25% data and so the replication can be either 1 node (25%) or two nodes (50%) but no one match my specification.

Does Adding a Rack in Cassandra necessitate a cluster rebalance? (non-vnode)

If you use rack designations and add new racks to a current cluster do you have to rebalance everything out after?
For example we currently have 2 racks. Then we'll double the node count by adding two new racks. Will cassandra have to rebalance the replica's out. The primary tokens will be balanced because the new nodes will have the correct tokens. But the replica's seem like they would be interspersed incorrectly.
If we know we'll be adding racks in the future, but we cannot afford to rebalance the cluster, should we just avoid racks completely in the first place?
Cassandra version is 1.2
Ok, after working with Cassandra for longer. Short answer is yes. Due to how Cassandra saves data if the nodes are not perfectly balanced across racks and in the right order you'd either have to rebalance or physically move the hardware to balance the racks. It seems like it is better to just manually balance the cluster by having the token's alternate racks than using cassandra's internal rack designation. However with vnodes this becomes less of an issue I believe.

How to migrate single-token cluster to a new vnodes cluster without downtime?

We have Cassandra cluster with single token per node, total 22 nodes, average load per node is 500Gb. It has SimpleStrategy for the main keyspace and SimpleSnitch.
We need to migrate all data to the new datacenter and shutdown the old one without a downtime. New cluster has 28 nodes. I want to have vnodes on it.
I'm thinking of the following process:
Migrate the old cluster to vnodes
Setup the new cluster with vnodes
Add nodes from the new cluster to the old one and wait until it balances everything
Switch clients to the new cluster
Decommission nodes from the old cluster one by one
But there are a lot of technical details. First of all, should I shuffle the old cluster after vnodes migration? Then, what is the best way to switch to NetworkTopologyStrategy and to GossipingPropertyFileSnitch? I want to switch to NetworkTopologyStrategy because new cluster has 2 different racks with separate power/network switches.
should I shuffle the old cluster after vnodes migration?
You don't need to. If you go from one token per node to 256 (the default), each node will split its range into 256 adjacent, equally sized ranges. This doesn't affect where data lives. But it means that when you bootstrap in a new node in the new DC it will remain balanced throughout the process.
what is the best way to switch to NetworkTopologyStrategy and to GossipingPropertyFileSnitch?
The difficulty is that switching replication strategy is in general not safe since data would need to be moved around the cluster. NetworkToplogyStrategy (NTS) will place data on different nodes if you tell it nodes are in different racks. For this reason, you should move to NTS before adding the new nodes.
Here is a method to do this, after you have upgraded the old cluster to vnodes (your step 1 above):
1a. List all existing nodes as being in DC0 in the properties file. List the new nodes as being in DC1 and their correct racks.
1b. Change the replication strategy to NTS with options DC0:3 (or whatever your current replication factor is) and DC1:0.
Then to add the new nodes, follow the process here: http://www.datastax.com/docs/1.2/operations/add_replace_nodes#adding-a-data-center-to-a-cluster. Remember to set the number of tokens to 256 since it will be 1 by default.
In step 5, you should set the replication factor for DC0 to be 0 i.e. change replication options to DC0:0, DC1:3. Now those nodes aren't being used so decommission won't stream any data but you should still do it rather than powering them off so they are removed from the ring.
Note one risk is that writes made at a low consistency level to the old nodes could get lost. To guard against this, you could write at CL.LOCAL_QUORUM after you switch to the new DC. There is still a small window where writes could get lost (between steps 3 and 4). If it is important, you can run repair before decommissioning the old nodes to guarantee no losses or write at a high consistency level.
If you are trying to migrate to a new cluster with vnodes, wouldn't you need to change the Partitioner. The documents say that it isn't a good idea to migrate data between different Partitioners.

Resources