Cassandra Replication Factor - cassandra

Lets say I have two Data Centers(DC1, DC2) in a Single Cassandra cluster.
DC1 - 4 nodes.
DC2 - 4 nodes.
Initially i have set the replication factor for all the keyspaces to be {DC1:2 , DC2:2}.(Network topology strategy)
But After some time lets say I alter the keyspace and change the replication factor to {DC:2} for all the keyspaces.(removing DC1).No replication factor for DC1.
So now what will happen? Will DC1 get any data written into it in the future?
Will all the token ranges be assigned to only DC2?

If you exclude DC1 - it won't get data written for that keyspace, nor data will be read from the DC1. Before switching off DC1, make sure that you perform nodetool repair on the serves in DC2, to make sure that you have all data synchronized. After changing RF, you
When you change RF for specific keyspace, drivers and Cassandra itself recalculate the token ranges assignments taking into account information about data centers.

Related

What is Cassandra system_auth replication factor 2 means?

As i read and understood from official cassandra document and from other posts here when we configure system_auth replication factor is 1.
But i would like to understood, how the system_auth replication works if i configure value as system_auth replication = 2?
which two nodes will maintain replicas?
There will be two copies of the system_auth keyspace spread across ALL of your nodes. That way, if one goes down, the data is still available on another node. Different entries to system_auth may be stored on different nodes, but there will always be two copies.
If your replication factor = the number of nodes, then each node will hold all the system_auth data. If your replication factor > number of nodes, you are gaining nothing, since all nodes already have a full copy of the data, no extra safety here. If your replication factor < number of nodes, no node will hold a complete copy of the data, but it will hold a portion of it.
Here system_auth replication = 2 means data of system_auth will be replicated on 2 nodes(total 2 copy of data) on cluster. if one node goes down then you can also able to login and authenticate the node.
you may increase the replication factor as well.

Insert rows only in one datacenter in cassandra cluster

For some test purposes I want to break a consistency of data in my test cassandra cluster, consisting of two datacenters.
I assumed that if I use a consistency level equal to LOCAL_QUORUM, or LOCAL_ONE I will achieve this. Let us say I have a cassandra node node11 belonging to DC1:
cqlsh node11
CONSISTENCY LOCAL_QUORUM;
INSERT INTO test.test (...) VALUES (...) ;
But in fact, data appears in all nodes. I can read it from the node22 belonging to the DC2 even with the consistency level LOCAL_*. I've double checked: the nodetool shows me the two datacenters and node11 certainly belongs to the DC1, while node22 belongs to the DC2.
My keyspace test is configured as follows:
CREATE KEYSPACE "test"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
and I have two nodes in each DC respectively.
My questions:
It seems to me that I wrongly understand the idea of these consistency level. In fact they do not prevent from writing data to the different DC's, but just ask for appearing of the data at least in the current datacenter. Is it correct understanding?
More essentially: is any way to perform such a trick and achieve such a "broken" consistency, when I have a different data stored in two datacenters within one cluster?
(At the moment I think that the only one way to achieve that - is to break the ring and do not allow nodes from one DC know anything about nodes from another DC, but I don't like this solution).
LOCAL_QUORUM, this consistency level requires a quorum of acknoledgement received from the local DC but all the data are sent to all the nodes defined in the keyspace.
Even at low consistency levels, the write is still sent to all
replicas for the written key, even replicas in other data centers. The
consistency level just determines how many replicas are required to
respond that they received the write.
https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
I don't think there is proper way to do that
This suggestion is to test scenario only to break data consistency between 2 DCs. (haven't tried but based on my understanding should work)
Write data in one DC (say DC1) with Local* consistency
Before write, keep nodes in DC2 down so DC1 will store hints as DC2 nodes are down.
Let max_hint_window_in_ms (3 hours by default - and you can reduce it) time pass so that DC1 coordinator will delete all the hints
Start DC2 nodes and query with LOCAL* query, the data from DC1 won't be present in DC2.
You can repeat these steps and insert data in DC2 with different values keeping DC1 down so same data will have different values in DC1 and DC2.

Read consistency LOCAL_QUORUM implication in cassandra

Thanks for your answer Nikita. Also, one more clarification. Assume I use LOCAL_QUORUM for read consistency in my multi-DC cluster with three DCs - DC1, DC2, DC3 with three nodes in each DC with replication factor of 3. During read, let us assume request first lands on a node in DC1. This node has failed and hence second node in DC1 is contacted and so on and assume all nodes in the DC1 have failed. Then will the cluster connect to either DC2 or DC3 to satisfy the LOCAL_QUORUM, i.e., look for acknowledgement from two consistent reads from either one of the DCs (either DC2 or DC3). I am not expecting one read from DC2 and another from DC3. What I mean to ask is if the cluster falls back on DC2 after all DC1 nodes fail, will it start evaluating the LOCAL_QUORUM factor within the perspective of DC2 and if yes, then will the cluster call it is a successful read?
CQL query won't hit other data centers in case of LOCAL_QUORUM can't be succeed in local data center. However drivers implement such feature using DCAwareRoundRobinPolicy as you mentioned, but seems that it's not recommended. Also this article can be helpful for choosing of proper consistency level.

Replication factor in Cassandra

I am newbie to cassandra.
What exactly replication factor in cassandra means?
For example,
I have 3 node cluster(node1,node2,node3) and If I create keyspace with replication factor 1,and insert data through node1,Can I read the data from other 2 nodes?
Or It will store the data in node1. Is data available in other 2 nodes for read/write operations?
The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. You should be able to read/write data from the other two nodes, depending on ports and firewalls between nodes.

Cassandra Cluster 1.1.10

I am new to Cassandra and at work I have a 4 node cluster.
nodetool gossipinfo tells me that there are one datacentre, 2 racks and 2 nodes in each rack. Replication factor is defined as 2. nodetool ring tell me that each node has 50% ownership. There are 2 seed nodes in our config. Each rack has 1 seed node.
Does this mean that for each rack, there is one seed node and its replicated node. If that is the case then why is datasize not the same for seed node and its replicated node.
what happens if one node goes down. Will it have any impact on the data availability of the cluster.
Seeds
Seeds nodes are only special in the way that new nodes that join the cluster contact the seed nodes to find out about other nodes and the topology of the ring. But in Cassandra, all nodes are the same, i.e. there are no master or slave, no primary or secondary node. Because of this, you can elect any (or all) node as the seed.
Since seeds only relate to gossip information, it does not have anything to do with replicated data.
Size
In relation to data size, each node will never be exactly the same since each partition/row size is never the same. If you look at the nodetool cfstats output, you will see that there is a big range between minimum and maximum sizes.
Availability
If the reads are done with a consistency level CL=ONE, then if a node is down the other replica will continue to serve requests. But if reads are done with a higher consistency, then reads will fail since it needs 2 nodes to be available, i.e. CL=LOCAL_QUORUM requires [ RF/2 + 1 ] nodes to respond.
EDIT: Response to:
Shouldn't each node own 25%?
Ownership
In Cassandra, data is not "distributed" across ALL nodes in ALL DCs. In fact, a DC is a copy of another DC depending on the replication factor.
To illustrate, consider the following keyspace definition:
CREATE KEYSPACE "myKS"
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 2,
'DC2' : 2};
Based on this definition, it means that the myKS keyspace has 2 replicas in DC1 and 2 replicas in DC2. Since each of your data centres only have 2 nodes, this effectively means that each DC is a copy of each other.
Following from that, since the tokens are split between 2 nodes, each node owns half of the data which is 50%. So in DC1, each node owns 50% and in DC2 (which is a copy of DC1) each node also owns 50%.

Resources