I was wondering which configuration will be best suited for even distribution of data among nodes.
5 nodes with 3 racs (2 nodes(node 1,node4) on rac1 , 2 nodes on rac2 (node2,node4) , 1 node on rac3 (node3))
Replication factor 3 and Read / Write on Quorum
In this case I am wondering whether the node3 which is the only node in rac3 will have more data than other nodes since replication strategy suggests that replicas will be but in nodes on different rac.
6 nodes with 3 racs (2 nodes(node 1,node4) on rac1 , 2 nodes on rac2 (node2,node4) , 2 nodes on rac3 (node3, node6))
Replication factor 3 and Read / Write on Quorum
In this case data will be distributed equally among all nodes.
Want to know whether my understanding is correct or not?
In the case of 5 nodes across 3 racks, yes, one node will be under greater load/stress.
It's a good idea to scale the cluster in multiples of the rack count to keep the data balanced across nodes. For example, in a 3 rack cluster you should add 3 nodes each time you expand the cluster.
If you choose to use multiple racks the ideal rack count should be ≥ your chosen replication factor. This allows Cassandra to store each replica in a separate rack.
In the case of a rack outage the other replicas would be still available.
For example, with RF=3 and 3 racks and queries at QUORUM, you can sustain the failure of a single rack. Whereas, with RF=3 and 2 racks at QUORUM, there is no guarantee that 2 replicas will still be available in the case of a rack failure.
Racks are for informing Cassandra about fault domains. If your running in your own data center, as the name implies, racks should be assigned based on the rack the node is located in. If you're running in the cloud, the best option is to map racks to AWS availability zones (or whatever is equivalent for your provider).
Yes, you should use 6 nodes to make sure you have equal number of nodes in each rack - having equal number of nodes in each rack is a basic requirement when going with multiple racks.
But, do you really need to have multiple racks ? because it makes scaling more difficult when you want to scale up as every time you need take care of the alternate node order and the data distribution.
In Cassandra, Multiple RACKs provide data continues availability in cassandra cluster for any disastrous situations. Cassandra recommends also the same in prod cluster. your both option is fine. however, you should go odd number of nodes in cassandra cluster.
Related
Suppose I have a cassandra cluster on 3 racks, lets say on 3 AZs in AWS and have a keyspace with RF=3. With this configuration, One replica will be present in each AZ.
What happens to the data when I add 2 more nodes to this cluster on 2new racks, AZ4 and AZ5? How do the replicas get resdistributed?
What happens to new data being added to this 5 rack cluster? How are replicas maintained?
When I tried it on AWS EC2, I see equal percentage allocation on all racks on running ./nodetool status
When the racks are equal to the RF as you currently have - 3 and 3, Cassandra walks the racks with the replicas and puts a replica in each rack it finds so it works out nicely and each rack gets a full copy of the data.
When you add two additional racks, the replicas will walk the racks and try to divide up the ownership so it is somewhat equal as you mentioned. In this scenario, no rack would contain a full replica, but ownership will be split among the racks.
If you intend to divide racks into AZs, then to get the most benefit, it's best to maintain the RF of 3 with 3 racks as you have it. If additional nodes are needed to scale up the current cluster, then add three nodes, one to each rack to keep them balanced. If you want additional copies of the data stored in separate AZs, add a second DC and maintain a rack/RF ratio that stores full replicas in the additional racks.
My Cassandra cluster consists of 2 DCs, each DC has 5 nodes and replication factor per DC is 3. Both DCs are hosted onto the same docker orchestrator. This is a legacy and probably it was done during last major system migration years ago. At the time being I don't see any advantage of having 2 DCs with same replication factor 3. This way same data is written 6 times. Cluster is at least 80% write heavy, reads are more or less limited.
Cassandra load is struggling at peak times, so I would like to have 1 DC with 10 nodes (instead of 2DCs x 5 ndoes) to be able to balance across 10 nodes, instead of just 5. This way I will also bring down data size per node. Having same amount of RAM and CPU dedicated to Cassandra, I would win performance and free storage space ;-)
So idea is to decommission DC2 and bring all 5 nodes from it to DC1 as brand new nodes.
Steps are known:
alter keyspaces to be limited to DC1 only.
no clients to be writing/reading to/from DC2 - DCAwarePolicy with LOCAL_*
I wonder about next step - it says I need to start decommissioning node by node DC2. Is this mandatory or I could somehow just take those nodes down? Goal is not to decommission some, but all nodes in a DC. If I decommission say node5, data would be transferred to remaining 4 nodes and so on. At some point I would be left with 3 nodes and replication factor 3, so I won't be able to decommission any further. What is more - I guess there would be no free space on those node volumes and I am not willing to extend those any further.
So my questions are:
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes? Or it strongly depends on the usage and the way services write and read data from Cassandra?
Appreciate your replies, thanks.
Cheers,
OvivO
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Yes, you can adjust the keyspace definition to just replicate within DC1. Since you're basically removing a DC, you could shut them all down, and run a nodetool removenode for each. In theory, that would remove the nodes from gossip and (if they're down) not attempt to move data around. Then yes, add each node back to DC1, one at a time. Once you're done, run a repair, followed by a nodetool cleanup on each node.
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
No. Token range assignment is DC dependent. If they moved to a new DC, their range assingments would change, and the nodes would very likely be responsible for different ranges of data.
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes?
Geographic awareness. If you have a mobile app and users on both the West Coast and East Coast, you don't want your East Coast users making a call for data all the way to the West Coast. You want that data call to happen as locally as possible. So, you'd build up a DC on each coast, and let Cassandra keep them in-sync.
Currently running a 3 node cluster with replication factor 3 on the keyspaces. Need to add more nodes to the cluster as the size of each node is approaching 2TB.
Can I add just 1 more node to the cluster and have a 4 node cluster or does the cluster always need to have odd number of nodes? Using a consistency level of ONE currently for both read and write.
You can have as many nodes in the cluster as you want, particularly if you are not using the racks feature in Cassandra (all nodes are in the same logical C* rack).
If you are using C* racks, our recommendation is to have an equal number of nodes in each rack so the load distribution is balanced across the racks in each DC.
For example, if your app keyspaces have a replication factor of 3 and you have 3 racks then the number of nodes in the DC should be in multiples of the replication factor -- 3, 6, 9, 12 and so on. This would allow you to configure the same number of nodes in each rack.
This isn't a hard requirement but is best practice so nodes have an equal amount of load and data on them. Cheers!
You can have even number of nodes in a Cassandra cluster. So you can add another node to the cluster. If you are using vnodes, then it will be easier, otherwise a lot of work needs to be done to balance the cluster.
One more thing, reading and writing with consistency level ONE decreases the consistency. If it suits your usecase then it is fine but general recommendation is to use QUORUM on the production system.
I am working on creating a cassandra cluster.
Our system is write heavey and planning use 3 seed nodes and total of 10 cassandra nodes (including 3 seed nodes).
We are using replication factor of 3 and consistency level QUORUM.
Is there any consideration of odd/even number of cassandra nodes based on replication factor / no of seed nodes.?
The number of seed nodes is unrelated to the replication factor. The seeds are used when a new node joins the cluster. New nodes consult the seeds to get their initial configuration and learn the addresses of the other nodes. You need 2-3 seeds to provide redundancy, that's all.
The replication factor indicates how many nodes have copies of the data as you probably know. RF=3 means three nodes have copies of data. Consistency level QUORUM means that 2 nodes need to reply to the coordinator (because 2 is a quorum of 3). This has nothing to do with the number of nodes in the cluster, as long as you have more than 3 nodes for RF=3! Even/odd doesn't matter, number of seeds doesn't matter.
Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.