The replication factor in cassandra and later change - cassandra

We have a 3 node Cassandra cluster, and we created a keyspace with a replication factor of 1. After we altered the keyspace's replication factor to 2 (and performed a nodetool repair for that key-space), I see it shares a Merkle tree with all three nodes.
My question, is why would it share with 3 nodes, and not only with 2 nodes?

When triggering a token redistribution, all nodes must be contacted. This is necessary, because all nodes must share an even amount of token range responsibility. To accomplish that, the token range responsibility will change per node; especially with such a small cluster.
Case-in-point, you had a 3 node cluster with a RF of 1. That means each node was responsible for 33.33% of the total tokens (-2^63 to 2^63 -1).
By increasing the RF from 1 to 2, while keeping the number of nodes constant, you are effectively doubling the amount data your cluster will store. Therefore, each node is now responsible for 66.67% of the data.
If you were to further increase your RF to 3, then each node would be responsible for 100% of your data, effectively storing all of your data on each node.

The reason it's talking to all of the other nodes is that some rows will go from node 1 to node 2, others from node 1 to node 3, some from node 2 to node 1, etc. Each row potentially will get redistributed, and where they start/end includes the entire nodes in the data center. Each row get's re-calculated to where it belongs. Does that make sense?

Related

Add nodes to existing Cassandra cluster

We currently have a 2 node Cassandra cluster. We want to add 4 more nodes to the cluster, using the rack feature. The future topology will be:
node-01 (Rack1)
node-02 (Rack1)
node-03 (Rack2)
node-04 (Rack2)
node-05 (Rack3)
node-06 (Rack3)
We want to use different racks, but the same DC.
But for now we use SimpleStrategy and replication factor is 1 for all keyspaces. My plan to switch from a 2 to a 6 node cluster is shown below:
Change Endpoint snitch to GossipingPropetyFileSnitch.
Alter keyspace to NetworkTopologyStrategy...with replication_factor 'datacenter1': '3'.
According to the docs, when we add a new DC to an existing cluster, we must alter system keyspaces, too. But in our case, we change only the snitch and keyspace strategy, not the Datacenter. Or should I change the system keyspaces strategy and replication factor too, in the case of adding more nodes and changing the snitch?
First, I would change the endpoint_snitch to GossipingPropertyFileSnitch on one node and restart it. You need to make sure that approach works, first. Typically, you cannot (easily) change the logical datacenter or rack names on a running cluster. Which technically you're not doing that, but SimpleStrategy may do some things under the hood to abstract datacenter/rack awareness, so it's a good idea to test it.
If it works, make the change and restart the other node, as well. If it doesn't work, you may need to add 6 new nodes (instead of 4) and decommission the existing 2 nodes.
Or should I change the system keyspaces strategy and replication factor too?
Yes, you should set the same keyspace replication definition on the following keyspaces: system_auth, system_traces, and system_distributed.
Consider this situation: If one of your 2 nodes crashes, you won't be able to log in as the users assigned to that node via the system_auth table. So it is very important to ensure that system_auth is replicated appropriately.
I wrote a post on this some time ago (updated in 2018): Replication Factor to use for system_auth
Also, I recommend the same approach on system_traces and system_distributed, as future node adds/replacements/repairs may fail if valid token ranges for those keyspaces cannot be located. Basically, using the same approach on them prevents potential problems in the future.
Edit 20200527:
Do I need to launch the nodetool cleanup on old cluster nodes after the snitch and keyspace topology changes? According to docs "yes," but only on old nodes?
You will need to run it on every node, except for the very last one added. The last node is the only node guaranteed to only have data which match its token range assignments.
"Why?" you may ask. Consider the total percentage ownership as the cluster incrementally grows from 2 nodes to 6. If you bump the RF from 1 to 2 (run a repair), and then from 2 to 3 and add the first node, you have a cluster with 3 nodes and 3 replicas. Each node then has 100% data ownership.
That ownership percentage gradually decreases as each node is added, down to 50% when the 6th and final node is added. But even though all nodes will have ownership of 50% of the token ranges:
The first 3 nodes will still actually have 100% of the data set, accounting for an extra 50% of the data that they should.
The fourth node will still have an extra 25% (3/4 minus 1/2 or 50%).
The fifth node will still have an extra 10% (3/5 minus 1/2).
Therefore, the sixth and final node is the only one which will not contain any more data than it is responsible for.

Cassandra vnodes replicas

Setting up the context:
Cassandra currently implements vnodes. 256 by default which is tweakable in the cassandra.yaml file
Vnodes as I understand are token-ranges/hash-ranges. Eg. (x...y], where y is the token number of the vnode. Each physical node in Cassandra is assigned random 256 tokens, and each of those tokens are the boundary value of a hash/token range. The tokens assigned are within the range of 2^-63 to 2^63-1 (the range of hash numbers which murmur3 has partitioner may generate). So far so good.
Question:
1. Is it that a token range(vnode) is a fixed range. Once set, this token range will be copied to other Cassandra nodes to satisfy the replication factor like a token range(vnode) being a fundamental chunk of data(tokens) which goes around together. Only in case of bootstrap of a new node in the cluster, this token range(vnode) might break apart to be assigned to other node.
Riding on the last proposition, (say the last proposition is true).
Then a vnode must only contain tokens which belong a given keyspace.
Because each keyspace(container of column family/tables) has a defined replication strategy and replication factor. And it is highly likely that replication factor of keyspaces in a Cassandra cluster will vary.
Consider an example. "system_schema" keyspace has a RF of 1 whereas I created a keyspace "test_ks" with RF 3. If a row of system_schema keyspace has a token number 2(say) and a row of my test_ks has token number 5(say).
these 2 tokens can't be placed in the same token range(vnode). If a vnode is consistent chunk of token ranges, say token 2 and 5 belong to vnode with token number 10. so vnode 10 has to be placed on 3 different physical nodes to satisfy the RF =3 for test_ks, but we are unnecessary placing token 2 on 3 different nodes whose RF is supposed to be 1.
Is this proposition correct that, a vnode is only dedicated to a given keyspace?
which boils down to out of 256 tokens on a physical node... 20(say) vnodes currently belong to "system" keyspace, 80 vnodes(say) belong to test_ks.
Again riding on the above proposition, this means that each node should have the info of keyspace-wise vnodes currently available in the cluster.
That way when a new write comes in for a Keyspace the co-ordinator node would locate all vnodes in the cluster for that keyspace and assign the new row a token number which falls within the token range of those keyspaces. That being the case can I know how many vnodes currently belong to a keyspace in the entire cluster/ or on a given node.
Please do correct me if I'm wrong.
I have been following the below blogs and videos to get an understanding of this concept:
https://www.scribd.com/document/253239514/Virtual-Nodes-Strategies-for-Apache-Cassandra
https://www.youtube.com/watch?v=GddZ3pXiDys&t=11s
Thanks in advance
There is no fixed token-range, the tokens are just generated randomly. This is one of the reasons that vnodes were implemented - the idea being that if there are more tokens it is more likely that the resulting token-ranges will be more evenly distributed across nodes.
Token generation was recently improved in 3.0, allowing Cassandra to place new tokens a little more intelligently (see CASSANDRA-7032). You can also manually configure tokens (see initial_token), although it can become tricky to keep things balanced when it comes time to expand the cluster unless you plan on doubling the number of nodes.
The total number of tokens in a cluster is the number of nodes in the cluster multiplied by the number of vnodes per node.
In regards to placement of replicas, the first copy of a partition is placed in the node that owns that partition's token. The additional n copies are placed sequentially on the next n nodes in the ring that are in the same data centre. There is no relationship between tokens and keyspaces.
When a new write comes into a coordinator node, the coordinator node determines which node owns the partition by hashing the partition key. Note that for better performance this can actually be done by the driver instead if you use TokenAwarePolicy. The coordinator sends the write to the node that owns the partition, and if the data needs to be replicated the coordinator node also writes the replicas to the next two nodes sequentially in the token-space.
For example, suppose that we have 3 nodes which each have one token: node1: 10, node2: 20 & node3: 30. If we write a record whose partition key hashes to 22, to a keyspace with RF3, then the first copy goes to node2, the second goes to node3 and the third goes to node1. Note that each replica is equally valid - there is nothing special about the "first" replica other than that it happens to be stored on the "first" replica node.
Vnodes do not change this process, they just split up each node's token ranges by allowing each node to have more than one token. For example, if our cluster now has 2 vnodes for each node, it might instead look like this: node1: 10, 25, node2: 20, 3 & node3: 30, 21. Now our write that hashed to 22 goes to node3 (because it owns the range from 21-24), and the copies go to node1 and node2.

Cassandra cluster works with 2 nodes?

I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Based on above description when i use murmur3partitioner,
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
when i read request to node 1 , will it internally connect with node 2 for consistency ?
And my intention is to make a replica and both nodes should server the request independently without inter communication.
First of all, please try to ask only one question at per post.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Incorrect. A RF=1 indicates that your entire cluster will have 1 copy of the data.
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
That is what it will try to do. Do note that it probably won't be exact. It'll probably be something like 49/51-ish.
when i read request to node 1 , will it internally connect with node 2 for consistency ?
With a RF=1, no it will not. Based on the hashed token value of your partition key, it will be directed only to the node which contains the data.
As an example, with a RF=2 with 2 nodes, it would depend on the consistency level set for your operation. Reading at ONE will always read only one replica. Reading at QUORUM will always read from 2 replicas with 2 nodes (after all, QUORUM of 2 equals 2). Reading at ALL will require a response from all replicas, and initiate a read repair if they do not agree.
Important to note, but you cannot force your driver to connect to a specific Cassandra node. You may provide one endpoint, but it will find the other via gossip, and use it as it needs to.

Significance of Vnodes in Cassandra

From the url: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2, they say:
"If instead we have randomized vnodes spread throughout the entire cluster, we still need to transfer the same amount of data, but now it’s in a greater number of much smaller ranges distributed on all machines in the cluster. This allows us to rebuild the node faster than our single token per node scheme."
Itseems that the above sentence convey, when we replace a dead node with a new node with same num_tokens (say num_tokens:4), then replaced node contain the same token value as the dead node had before releasing those token values.
But Vnodes generates random token values for every node, then how is it possible to replace a node with same Vnodes token value?
The URL seems confusing in explaining the concept of replacing a dead node with new node using the concept of VNODES. It would be nice if someone can clarify how a Vnode is used for replacing the dead node with exact token value ranges.
Thanks in advance.
First, the vnode parameter num_tokens should be set to a small number, the current recommendation from DataStax is eight (8). The original default was 256, which experience found to be too high.
With traditional token ranges, you only have as many ranges as nodes. But, using vnodes, the number of token ranges is virtualized and much larger. You can not mix vnodes and token ranges in the same data center (ring).
Node Failure With Token Ranges:
In this DataStax example above with token ranges, data for ranges C, D and E reside on just three nodes:
Range C is owned by node 3 and replicated on nodes 4 and 5
Range D is owned by node 4 and replicated on nodes 5 and 6
Range E is owned by node 5 and replicated on nodes 6 and 1
In this example, when node 5 fails, Ranges C, D, and E are reloaded and streamed from only three of the remaining five nodes: 1, 3 and 4. Nodes 2 doesn't have any of the node 5 data and node 6 has the same data being streamed by node 1. Thus nodes 2 and 5 are idle during the rebuilding.
Node Failure With Vnodes:
However, when using vnodes the token ranges are split up into smaller ranges and randomized across the entire cluster of 6 nodes. With smaller ranges, a portion of node 5's data is replicated to every one of the other nodes.
When rebuilding node 5, data can now be streamed from all 5 of the available nodes in the cluster.
The primary advantages of vnodes are:
rebalancing a cluster manualy is no longer required when adding or removing nodes
rebuilding can stream data from all available nodes

Cassandra token for three replicas

I'm trying to build two 3-node Cassandra clusters in separate data centers. I want to have NetworkToplogyStrategy replication between them, with a replication factor of 3 in each. Thus, I want each node in each data center to have the same records.
Question, what should my token assignment look like for each node? (since i'm not actually partitioning, just replicating).
Thank you!
If you're using Cassandra 1.2 use virtual nodes with automatic assignment.
If you're using 1.1 or earlier, use for one DC the evenly distributed tokens:
0
56713727820156410577229101238628035242
113427455640312821154458202477256070484
(0, 1 and 2 times 2**127/3)
For the other DC, you can choose anything as long as it is also evenly distributed. Offsetting by 1 works:
1
56713727820156410577229101238628035243
113427455640312821154458202477256070485
Although for now the tokens don't matter since all nodes hold the same data, if you want to scale in the future it will help to have them already balanced.

Resources