I am new to Cassandra and at work I have a 4 node cluster.
nodetool gossipinfo tells me that there are one datacentre, 2 racks and 2 nodes in each rack. Replication factor is defined as 2. nodetool ring tell me that each node has 50% ownership. There are 2 seed nodes in our config. Each rack has 1 seed node.
Does this mean that for each rack, there is one seed node and its replicated node. If that is the case then why is datasize not the same for seed node and its replicated node.
what happens if one node goes down. Will it have any impact on the data availability of the cluster.
Seeds
Seeds nodes are only special in the way that new nodes that join the cluster contact the seed nodes to find out about other nodes and the topology of the ring. But in Cassandra, all nodes are the same, i.e. there are no master or slave, no primary or secondary node. Because of this, you can elect any (or all) node as the seed.
Since seeds only relate to gossip information, it does not have anything to do with replicated data.
Size
In relation to data size, each node will never be exactly the same since each partition/row size is never the same. If you look at the nodetool cfstats output, you will see that there is a big range between minimum and maximum sizes.
Availability
If the reads are done with a consistency level CL=ONE, then if a node is down the other replica will continue to serve requests. But if reads are done with a higher consistency, then reads will fail since it needs 2 nodes to be available, i.e. CL=LOCAL_QUORUM requires [ RF/2 + 1 ] nodes to respond.
EDIT: Response to:
Shouldn't each node own 25%?
Ownership
In Cassandra, data is not "distributed" across ALL nodes in ALL DCs. In fact, a DC is a copy of another DC depending on the replication factor.
To illustrate, consider the following keyspace definition:
CREATE KEYSPACE "myKS"
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 2,
'DC2' : 2};
Based on this definition, it means that the myKS keyspace has 2 replicas in DC1 and 2 replicas in DC2. Since each of your data centres only have 2 nodes, this effectively means that each DC is a copy of each other.
Following from that, since the tokens are split between 2 nodes, each node owns half of the data which is 50%. So in DC1, each node owns 50% and in DC2 (which is a copy of DC1) each node also owns 50%.
Related
My Cassandra cluster consists of 2 DCs, each DC has 5 nodes and replication factor per DC is 3. Both DCs are hosted onto the same docker orchestrator. This is a legacy and probably it was done during last major system migration years ago. At the time being I don't see any advantage of having 2 DCs with same replication factor 3. This way same data is written 6 times. Cluster is at least 80% write heavy, reads are more or less limited.
Cassandra load is struggling at peak times, so I would like to have 1 DC with 10 nodes (instead of 2DCs x 5 ndoes) to be able to balance across 10 nodes, instead of just 5. This way I will also bring down data size per node. Having same amount of RAM and CPU dedicated to Cassandra, I would win performance and free storage space ;-)
So idea is to decommission DC2 and bring all 5 nodes from it to DC1 as brand new nodes.
Steps are known:
alter keyspaces to be limited to DC1 only.
no clients to be writing/reading to/from DC2 - DCAwarePolicy with LOCAL_*
I wonder about next step - it says I need to start decommissioning node by node DC2. Is this mandatory or I could somehow just take those nodes down? Goal is not to decommission some, but all nodes in a DC. If I decommission say node5, data would be transferred to remaining 4 nodes and so on. At some point I would be left with 3 nodes and replication factor 3, so I won't be able to decommission any further. What is more - I guess there would be no free space on those node volumes and I am not willing to extend those any further.
So my questions are:
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes? Or it strongly depends on the usage and the way services write and read data from Cassandra?
Appreciate your replies, thanks.
Cheers,
OvivO
is there a way to alter keyspace to DC1 only, then just to bring all DC2 nodes down, erase volumes and add them one by one to DC1, expanding DC1? Basically to decommission all DC2 nodes at once?
Yes, you can adjust the keyspace definition to just replicate within DC1. Since you're basically removing a DC, you could shut them all down, and run a nodetool removenode for each. In theory, that would remove the nodes from gossip and (if they're down) not attempt to move data around. Then yes, add each node back to DC1, one at a time. Once you're done, run a repair, followed by a nodetool cleanup on each node.
Is there a way for even quicker move of those 5 DC2 nodes to DC1 (at the end they contain same data as 5 nodes in DC1)? Like just join them to DC1 with all data they contain?
No. Token range assignment is DC dependent. If they moved to a new DC, their range assingments would change, and the nodes would very likely be responsible for different ranges of data.
What is the advantage of having 2 DCs in a single cluster, instead of having a single DC with double the nodes?
Geographic awareness. If you have a mobile app and users on both the West Coast and East Coast, you don't want your East Coast users making a call for data all the way to the West Coast. You want that data call to happen as locally as possible. So, you'd build up a DC on each coast, and let Cassandra keep them in-sync.
For some test purposes I want to break a consistency of data in my test cassandra cluster, consisting of two datacenters.
I assumed that if I use a consistency level equal to LOCAL_QUORUM, or LOCAL_ONE I will achieve this. Let us say I have a cassandra node node11 belonging to DC1:
cqlsh node11
CONSISTENCY LOCAL_QUORUM;
INSERT INTO test.test (...) VALUES (...) ;
But in fact, data appears in all nodes. I can read it from the node22 belonging to the DC2 even with the consistency level LOCAL_*. I've double checked: the nodetool shows me the two datacenters and node11 certainly belongs to the DC1, while node22 belongs to the DC2.
My keyspace test is configured as follows:
CREATE KEYSPACE "test"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
and I have two nodes in each DC respectively.
My questions:
It seems to me that I wrongly understand the idea of these consistency level. In fact they do not prevent from writing data to the different DC's, but just ask for appearing of the data at least in the current datacenter. Is it correct understanding?
More essentially: is any way to perform such a trick and achieve such a "broken" consistency, when I have a different data stored in two datacenters within one cluster?
(At the moment I think that the only one way to achieve that - is to break the ring and do not allow nodes from one DC know anything about nodes from another DC, but I don't like this solution).
LOCAL_QUORUM, this consistency level requires a quorum of acknoledgement received from the local DC but all the data are sent to all the nodes defined in the keyspace.
Even at low consistency levels, the write is still sent to all
replicas for the written key, even replicas in other data centers. The
consistency level just determines how many replicas are required to
respond that they received the write.
https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
I don't think there is proper way to do that
This suggestion is to test scenario only to break data consistency between 2 DCs. (haven't tried but based on my understanding should work)
Write data in one DC (say DC1) with Local* consistency
Before write, keep nodes in DC2 down so DC1 will store hints as DC2 nodes are down.
Let max_hint_window_in_ms (3 hours by default - and you can reduce it) time pass so that DC1 coordinator will delete all the hints
Start DC2 nodes and query with LOCAL* query, the data from DC1 won't be present in DC2.
You can repeat these steps and insert data in DC2 with different values keeping DC1 down so same data will have different values in DC1 and DC2.
I set keyspace like the following.
CREATE KEYSPACE name_of_keyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 3};
If I want to follow the rule of this keyspace, do I need to have 3 or 4 nodes in dc1?
The reason why I'm confused is that there are two different types of nodes, one is coordinator node and the other is general node that can be chosen when a node fails.
Should I include this coordinator node as part of general node and create only 3 nodes in dc1 or create 4 nodes to make this work?
In Cassandra all nodes can act as coordinator. So for a request that requires a coordinator the node the client connected to will act as a coordinator.
A RF of 3 with 4 nodes is fine for a DC, but it is not needed unless you have a capacity you are trying to reach with the extra node. In one of my clusters we have 18 nodes for capacity with a RF of 3. That's generally how you scale Cassandra.
The coordinator node is chosen at query time. All nodes have the same capabilities.
When you run a cluster with rf 3 and run a query, for a partition:
need only one node up if you read/write with consistency level 1.
need two nodes if you read/write with quorum or two CL
three nodes if you read/write with all or three CL
Note that the read/writes are issued to all nodes that holds/should write the data, but the driver will wait for the configured level.
Check this page for more information about consistency levels.
So, you can run a 3 nodes cluster with rf 3,and depending on what CL you read/write you can survive 0, 1, or 2 nodes being down.
I am working on creating a cassandra cluster.
Our system is write heavey and planning use 3 seed nodes and total of 10 cassandra nodes (including 3 seed nodes).
We are using replication factor of 3 and consistency level QUORUM.
Is there any consideration of odd/even number of cassandra nodes based on replication factor / no of seed nodes.?
The number of seed nodes is unrelated to the replication factor. The seeds are used when a new node joins the cluster. New nodes consult the seeds to get their initial configuration and learn the addresses of the other nodes. You need 2-3 seeds to provide redundancy, that's all.
The replication factor indicates how many nodes have copies of the data as you probably know. RF=3 means three nodes have copies of data. Consistency level QUORUM means that 2 nodes need to reply to the coordinator (because 2 is a quorum of 3). This has nothing to do with the number of nodes in the cluster, as long as you have more than 3 nodes for RF=3! Even/odd doesn't matter, number of seeds doesn't matter.
Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.