Re-replication on region failover in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Because I'd like to use Multi-DC deployments features, I found that there is a case that the number of replicas can’t be ensured if I modify placement information and all nodes fail in one zone.
There are three zones and I created two nodes in each zone. I tested YugabyteDB 2.11.1.0
I modify placement info to control minimum replicas in each region.
yb-admin \
-master_addresses ${MASTER_IPADDRS} \
modify_placement_info \
cloud1.region-a.region-a-1,cloud1.region-b.region-b-1,cloud1.region-c.region-c-1 3
What happens if all nodes in region-a-1 fail?
kill -kill ${YB-Tserver processes in region-a-1}
Although my expectation is that new replicas will be created in region-b-1 or region-c-1.
Should they be created?
The replication factor is 3, but each tablet has only 2 replicas in the YugabyteDB cluster.
Is the above behavior expected or am I missing something? It seems that the maximum replicas in each region is configured to 1.

The modify_placement_info command ensures how many minimum & exact replicas of each tablet we want in that placement block. This command is used to configure the number of replicas in multi-region deployments. In your case, with replication_factor=3, you’ll want 1 replica in each region. So even in the case of losing a single region, you can maintain availability for writes & reads. But, if you lose a particular region, new replicas will not be created in the remaining regions. You need to bring up the region that has failed.
The other option would be to have replication_factor=5, and then you can keep 2 regions with 2 replicas each and another region with 1 replica like below:
$ ./bin/yb-admin \
-master_addresses $MASTER_RPC_ADDRS \
modify_placement_info \
aws.us-west.us-west-2a:2,aws.us-west.us-west-2b:2,aws.us-west.us-west-2c 5
This will place a minimum of:
2 replicas in aws.us-west.us-west-2a
2 replicas in aws.us-west.us-west-2b
1 replica in aws.us-west.us-west-2c
See the doc page of the command for more information.

Related

Cassandra: 2 required but only 1 alive & 3 replica were required but only 2 acknowledged the write

I got two errors on write the data into Cassandra, want to know the difference between them.
3 replica were required but only 2 acknowledged the write
2 required but only 1 alive
Consistency Level is LOCAL_QUORUM.
As per my observations, When I got the first exception I see the data is written into one of the node, on second exception I do not see the data in any node.
Is my observation is correct, please help me on this.
It's a bit difficult to provide a clear answer without knowing the cluster topology and the keyspace replication factor. The full error messages + full stack trace are also relevant to know.
For LOCAL_QUORUM consistency to require 3 replicas to respond, it indicates that you either have 4 or 5 nodes in the local DC -- quorum of 4 or 5 is 3.
In the second instance, LOCAL_QUORUM requires 2 replicas when there are either 2 or 3 nodes in the local DC. And yes, quorum of 2 nodes is still 2 nodes meaning your app cannot tolerate an outage if either node goes down. For this reason, we recommend a minimum of 3 nodes in each DC for production clusters. Cheers!

Add nodes to existing Cassandra cluster

We currently have a 2 node Cassandra cluster. We want to add 4 more nodes to the cluster, using the rack feature. The future topology will be:
node-01 (Rack1)
node-02 (Rack1)
node-03 (Rack2)
node-04 (Rack2)
node-05 (Rack3)
node-06 (Rack3)
We want to use different racks, but the same DC.
But for now we use SimpleStrategy and replication factor is 1 for all keyspaces. My plan to switch from a 2 to a 6 node cluster is shown below:
Change Endpoint snitch to GossipingPropetyFileSnitch.
Alter keyspace to NetworkTopologyStrategy...with replication_factor 'datacenter1': '3'.
According to the docs, when we add a new DC to an existing cluster, we must alter system keyspaces, too. But in our case, we change only the snitch and keyspace strategy, not the Datacenter. Or should I change the system keyspaces strategy and replication factor too, in the case of adding more nodes and changing the snitch?
First, I would change the endpoint_snitch to GossipingPropertyFileSnitch on one node and restart it. You need to make sure that approach works, first. Typically, you cannot (easily) change the logical datacenter or rack names on a running cluster. Which technically you're not doing that, but SimpleStrategy may do some things under the hood to abstract datacenter/rack awareness, so it's a good idea to test it.
If it works, make the change and restart the other node, as well. If it doesn't work, you may need to add 6 new nodes (instead of 4) and decommission the existing 2 nodes.
Or should I change the system keyspaces strategy and replication factor too?
Yes, you should set the same keyspace replication definition on the following keyspaces: system_auth, system_traces, and system_distributed.
Consider this situation: If one of your 2 nodes crashes, you won't be able to log in as the users assigned to that node via the system_auth table. So it is very important to ensure that system_auth is replicated appropriately.
I wrote a post on this some time ago (updated in 2018): Replication Factor to use for system_auth
Also, I recommend the same approach on system_traces and system_distributed, as future node adds/replacements/repairs may fail if valid token ranges for those keyspaces cannot be located. Basically, using the same approach on them prevents potential problems in the future.
Edit 20200527:
Do I need to launch the nodetool cleanup on old cluster nodes after the snitch and keyspace topology changes? According to docs "yes," but only on old nodes?
You will need to run it on every node, except for the very last one added. The last node is the only node guaranteed to only have data which match its token range assignments.
"Why?" you may ask. Consider the total percentage ownership as the cluster incrementally grows from 2 nodes to 6. If you bump the RF from 1 to 2 (run a repair), and then from 2 to 3 and add the first node, you have a cluster with 3 nodes and 3 replicas. Each node then has 100% data ownership.
That ownership percentage gradually decreases as each node is added, down to 50% when the 6th and final node is added. But even though all nodes will have ownership of 50% of the token ranges:
The first 3 nodes will still actually have 100% of the data set, accounting for an extra 50% of the data that they should.
The fourth node will still have an extra 25% (3/4 minus 1/2 or 50%).
The fifth node will still have an extra 10% (3/5 minus 1/2).
Therefore, the sixth and final node is the only one which will not contain any more data than it is responsible for.

Adding nodes to a dispersed gluster cluster

I was reading the gluster documentation and am having difficulty figuring out exactly how my cluster ought to be configured.
Suppose I decided to set up a dispersed distributed cluster with 3 bricks and redundancy = 1.
If I did this do I have to add bricks in groups of 3, or can I add 1 or 2 bricks if desired?
If I add 3 bricks to the cluster does the redundancy number change? I looked at this: https://lists.gluster.org/pipermail/gluster-users/2018-July/034491.html and it said that the redundancy number is constant throughout the life of the cluster, which I find odd - if I start out tiny with like 3 nodes and then hit the jackpot and want to seriously ramp up my cluster's size so I make it so there are 60 nodes having a redundancy number of 1 is probably not appropriate, whereas a redundancy number of 1 is appropriate if there are 3 nodes. With this in mind, if the redundancy number is constant (per the website quoted) how does one scale a gluster cluster up by an order of magnitude?
Yes you need to add bricks in groups of 3.
When you add more nodes (in multiples of 3) to expand the volume, you are increasing the distribute count thereby increasing the volume capacity. The redundancy number is something that is to be viewed as being applicable to each disperse 'sub volume' of the cluster and not something like 1 node redundancy for every 60 nodes. So your volume scales from a 1x(2+1) to a 30x(2+1) and each of those 30 disperse sub volumes each have a redundancy factor of 1.

Alter Keyspace on cassandra 3.11 production cluster to switch to NetworkTopologyStrategy

I have a cassandra 3.11 production cluster with 15 nodes. Each node has ~500GB total with replication factor 3. Unfortunately the cluster is setup with Replication 'SimpleStrategy'. I am switching it to 'NetworkTopologyStrategy'. I am looking to understand the caveats of doing so on a production cluster. What should I expect?
Switching from mSimpleStrategy to NetworkTopologyStrategy in a single data center configuration is very simple. The only caveat of which I would warn, is to make sure you spell the data center name correctly. Failure to do so will cause operations to fail.
One way to ensure that you use the right data center, is to query it from system.local.
cassdba#cqlsh> SELECT data_center FROM system.local;
data_center
-------------
west_dc
(1 rows)
Then adjust your keyspace to replicate to that DC:
ALTER KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy',
'west_dc': '3'};
Now for multiple data centers, you'll want to make sure that you specify your new data center names correctly, AND that you run a repair (on all nodes) when you're done. This is because SimpleStrategy treats all nodes as a single data center, regardless of their actual DC definition. So you could have 2 replicas in one DC, and only 1 in another.
I have changed RFs for keyspaces on-the-fly several times. Usually, there are no issues. But it's a good idea to run nodetool describecluster when you're done, just to make sure all nodes have schema agreement.
Pro-tip: For future googlers, there is NO BENEFIT to creating keyspaces using SimpleStrategy. All it does, is put you in a position where you have to fix it later. In fact, I would argue that SimpleStrategy should NEVER BE USED.
so when will the data movement commence? In my case since I have specific rack ids now, so I expect my replicas to switch nodes upon this alter keyspace action.
This alone will not cause any adjustments of token range responsibility. If you already have a RF of 3 and so does your new DC definition, you won't need to run a repair, so nothing will stream.
I have a 15 nodes cluster which is divided into 5 racks. So each rack has 3 nodes belonging to it. Since I previously have replication factor 3 and SimpleStrategy, more than 1 replica could have belonged to the same rack. Whereas NetworkStrategy guarantees that no two replicas will belong to the same rack. So shouldn't this cause data to move?
In that case, if you run a repair your secondary or ternary replicas may find a new home. But your primaries will stay the same.
So are you saying that nothing changes until I run a repair?
Correct.

Hazelcast - PartitionGroup + Multiple Backups

Assuming 4 nodes split across 2 data centers (DC1-1, DC1-2, DC2-1, DC2-2).
Using partition groups and the default backup count of 1, the documentation and other questions/articles are pretty clear about how data is distributed assuming well distributed data - 25% per node as primary, all the primary data in DC1-1/DC1-2 will be backed up on either DC2-1/DC2-2 and vice versa.
It is not clear what the expected behavior is under same situation if we were to increase backup count to 2. Assuming entry #1 currently as primary on DC1-1. Would the two entries of backup both be forced to the two DC2 nodes? Is there a way to make it such that there is one backup in each partitiongroup? (i.e. primary DC1-1, backup on DC1-2, backup on either DC2-1 or DC2-2)?
Thanks
First of all we do not recommend to split a single cluster over multiple data centers. There are possible exceptions but keep in mind that latency between data centers is important as you partition the data.
To your question:
If you have just two partition groups defined there is no way to create more than one backup. You have to imagine a normal cluster to be one node per partition group, therefore you can have pG-1 backups. If you change the configuration to 2 partition groups that means you can only have one backup.

Resources