Synchronizing keyspaces in new cassandra datacenter - cassandra

I have a question about a potential scenario and wanted to know if our assumption is correct. (using cassandra 3.x with DSE 5.x)
We've learned from the docs that in order to add a new (and fresh) datacenter to a cluster, we need to temporarily set ReplicationFactor like so:
{'class' : 'NetworkTopologyStrategy', 'DC1' : 3, 'DC2' : 0 }
Where DC1 is the currently running datacenter and DC2 is the one we are adding.
This test helped us understand the impact of the streaming of data from an existing live ring to a brand new one.
Now to our hypothetical scenario, which is to be able to start replicating a keyspace that was initially only replicated to one DC, to now save to other currently running DCs.
When creating the keyspace:
CREATE KEYSPACE Foo WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US' : 2, 'EU' : 0};
Then, when business requirements change:
ALTER KEYSPACE Foo WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US' : 2, 'EU' : 2};
Is it considered safer to define all new keyspaces in an application with all DCs to 0, so that the value can be modified at some point. And would changing that replication factor be enough to trigger the streaming of the keyspace to the other datacenters - or do we also need to run nodetool rebuild?

The accepted practice is to simply not define a replication factor for a DC that you don't want a particular keyspace to replicate to. I don't think that anything bad would happen if you did it your way, but I feel that not defining it would be the safer way to go.
would changing that replication factor be enough to trigger the streaming of the keyspace to the other datacenters - or do we also need to run nodetool rebuild?
Altering the replication factor on the keyspace will tell all future writes to that keyspace to go to also the new data center. However, for the existing data to replicate to the new data center you will have to run a nodetool repair or nodetool rebuild.

Related

Removed DC2 from replication but still able query date from DC2 nodes

I am trying to remove a DC2 from my Cassandra Cluster. For this I start with altering replication factor from 2 to 0 in DC2. I try to insert a row in DC1 node1 and I am still receiving this row while queried from DC2 nodes.
Why is this happening?
I'm assuming that you're querying the data with cqlsh. By default, it uses a consistency of ONE so it will query any replica. In your case, they all happen to be in DC1.
If you try to query with a local consistency then you will probably get the result (or lack of) which I think you're expecting.
As a side note, although setting replication to 0 is technically valid, it is more customary to simply remove a DC completely from replication so you end up with:
ALTER KEYSPACE some_ks WITH REPLICATION = { \
'class' : 'NetworkTopologyStrategy', \
'DC1' : 3
}

use of replication_factor in cassandra

I have this code:
CREATE KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’, 'replication_factor' : ‘No.Of replicas’}
What is the use of 'replication_factor' in the above Cassandra query?
The replication factor is an integer, it determines how many time your data is replicated across your cluster. You usually want to replicate your data to achieve high availability. Of course, it comes at the cost of extra storage. Replication factor (RF) of 3 is by far the most common value. This works if you have at least 3 nodes in your cluster.

Lost data after running nodetool decommission

I have a 3 node cluster with 1 seed and nodes in different zones. All running in GCE with GoogleCLoudSnitch.
I wanted to change hardware on each node so I started with adding a new seed in a different region which joined perfectly to the cluster. Then I started with "nodetool decommission" and when done I removed the the node when it is down and "nodetool status" states it's not in the cluster. I did this for all nodes and lastly I did it on the "extra" seed in the different region just to remove it to get back to a 3 node cluster.
We lost data! What can possibly be the problem? I saw a commando, "nodetool rebuild", which I ran and actually got some data back. "nodetool cleanup" didn't help either. Should I have run "nodetool flush" prior to "decommission"?
At the time of running "decommission" most keyspaces had ..
{'class' : 'NetworkTopologyStrategy', 'europe-west1' : 2}"
Should I first altered key spaces to include the new region/datacenter, which would be "'europe-west3' : 1" since only one node exist in that datacenter? I also noted that some keyspaces in the cluster had by mistake ..
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
Could this have caused the loss of data? It seems that it was in the "SimpleStrategy keyspaces" the data was lost.
(Disclaimer: I'm a ScyllaDB employee)
Did you 1st add new nodes to replace the ones you are decommissioning and configured the keyspace replication strategy accordingly? (you only mentioned the new seed node in your description, you did not mention if you did it for the other nodes).
Your data loss can very well be a result of the following:
Not altering the keyspaces to include the new region/zone with the proper replication strategy and replication factor.
Keyspaces that were configured with simple strategy (no netwrok aware) replication policy and replication factor 1. This means that the data was stored only in 1 node, and once that node went down and decommissioned, you basically lost the data.
Did you by any chance took snapshots and stored them outside your cluster? If you did you could try and restore them.
I would highly recommend reviewing these procedures for better understanding and the proper way to perform the procedure you intended to perform:
http://docs.scylladb.com/procedures/add_dc_to_exist_dc/
http://docs.scylladb.com/procedures/replace_running_node/

Cassandra one way replication

Does Cassandra support one direction replication? Say I have 2 DCs, DC1 and DC2. Real time data is being written only in DC1 and asynch replication happens in DC2. Is there a way now if I do some write on same data in DC2, it does not get replicated in DC1?
There is no concept of one way replication. If your replication factor is 2 then it will replicate data in any two nodes. You are using DC1 and DC2 then you have to use the "NetworkTopologyStrategy" and define the replication factor for each DC. Your problem will automatically resolve using "Snitch" tool to decided data store in different nodes in both DC's.
This feature is available when you create a keyspace
Let's say you want the keyspace 1 to be replicated on both datacenters and keyspace 2 on one datacenter:
This will replicate your data on one datacenter:
CREATE KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 };
And this on both datacenters :
CREATE KEYSPACE keyspace2
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1, 'datacenter2' : 1};
There is no concept of one way replication. You have a few options:
1) use low consistency levels (LOCAL_*) when writing on writes to DC2 so the app doesn't block to replicate to DC1
2) keep the dcs in separate rings, and bulk load a synchronously with stable loader

Cassandra different replication factor across cluster

Is is possible to have different replication settings on different nodes of the same cluster?
(All DCs have same keyspace/tables, but different replication settings)
We would like to have DC1 and DC2 collecting sensor data on different geographical locations, and sending these to a DC3. So DC3 contains all data from DC1 + DC2.
However, DC1 and DC2 should not contain each other's data (only data which was written by local clients).
Can this be achieved in Cassandra by having different keyspace replication settings on the DCs?
On DC1: 'DC1':1, 'DC3':1
On DC2: 'DC2':1, 'DC3':1
On DC3: 'DC3':1
You can't really do this with NetworkTopologyStrategy. Depending on how much effort you want to put into this you could implement your own replication strategy. I don't think this is very common, but Cassandra does allow it and it likely wouldn't be too difficult to implement what you want (take a look at NTS's implementation as an example).
If you don't want to implement your own strategy I would recommend creating 2 keyspaces with the following configuration:
CREATE KEYSPACE keyspace1
WITH replication = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 1,
'DC3' : 1
};
CREATE KEYSPACE keyspace2
WITH replication = {
'class' : 'NetworkTopologyStrategy',
'DC2' : 1,
'DC3' : 1
};
and then depending on the location of your client you would use either keyspace.

Resources