How Cassandra token rings work - cassandra

I am in doubt about how cassandra partition data accross the cluster.
Accorind to nodetool status my cassandra have 2048 tokens in 2 datacenters. I assume that it divides the info in 1024 identical token ranges in each DC right?
Anyway, those 1024 tokens is for all CFs or each CF has 1024 token ranges?

There are 16 nodes in your cluster
128 token ranges * 16 nodes nodes = 2048 token ranges
Each datacenter has their own ring of tokens depending on the number of nodes in each datacenter. For example, a datacenter with 10 nodes would have their token ring divided into 128*10 segments.
Each column family will use all of these ranges to distribute data within the datacenter.
For example, a datacenter with 3 token ranges would mean that each individual write will be placed either on token range 1,2 or 3. Another column family would follow this same standard.

Related

Cassandra Token Distribution

I am going through cassandra tutorials and come across this picture that represents multinode cassandra cluster -
Isnt total number of tokens ( in the above 256 ) should be distributed across all three nodes around 85 tokens each ?
No, the num_tokens parameter specifies how many tokens ranges each node will handle. From cassandra.yaml description:
This defines the number of tokens randomly assigned to this node on the ring. The more tokens, relative to other nodes, the larger the proportion of data that this node will store. You probably want all nodes to have the same number of tokens assuming they have equal hardware capability.
Otherwise, what would happen if you have cluster with more than 256 nodes? ;-)

Avoid Cassandra full table scan cross DC

I have Cassandra cluster nodes distributed across 2 data centers. 6 nodes in each data center, a total of 12 nodes
My keyspace definition:
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'} AND durable_writes = true;
In each node, I have 64 tokens.
I am using Cassandra driver to connect to the cluster and using default load balancing policy DCAwareRoundRobinPolicy and using only dc1 nodes as contact points. So, I assume dc2 nodes will have HostDistance IGNORED and my application won't connect to them.
Note: For all my reads and writes I use the same configuration
My use case is to do a full table scan. But, I can not use Spark. So, instead, I an achieving this by getting all token range using metadata.getTokenRanges() and querying these token ranges in multiple threads.
Everything works fine. But, metadata.getTokenRanges() return 768 tokens(64*12). Which means it's giving me token range across all 12 nodes.
Since I have to run through all token ranges. Even with multiple threads, the process is very slow.
Is there any way I can get token ranges of only one data center. I even tried to get token ranges using metadata.getTokenRanges("my_keyspace", host from dc1).
I do get less number of tokens(517), but when I use this list, I get fewer data.
How can I get token ranges of only 1 DC?
Edit: I checked read/write latency in both the clusters. I do not see any operations being performed on dc2, whereas I can see a clear spike on my dc1 data center.
This is even more puzzling for me now. If dc2 is never queried how I am getting 64*12 +1 token ranges? and why not 64*6+1?
Your replication_factor is 3+3=6. So you may have 6 times the actual data. 3 copies in dc1 and 3 copies in dc2.
You have 64 vnodes per node so 64*12 =768 vnodes.
So, if you want to do a complete table scan then you might have to query all the token ranges i.e 768. What you are missing is that, because of the replication all that token ranges' data will reside within dc1. So you can get all the data from dc1 itself.
If you are using DCAwareRoundRobinPolicy and set .withLocalDc() with dc1 and consistency level LOCAL_* then you are reading only from dc1. dc1 will have all the data because the replication_factor of dc1 is 3.

Cassandra vnodes replicas

Setting up the context:
Cassandra currently implements vnodes. 256 by default which is tweakable in the cassandra.yaml file
Vnodes as I understand are token-ranges/hash-ranges. Eg. (x...y], where y is the token number of the vnode. Each physical node in Cassandra is assigned random 256 tokens, and each of those tokens are the boundary value of a hash/token range. The tokens assigned are within the range of 2^-63 to 2^63-1 (the range of hash numbers which murmur3 has partitioner may generate). So far so good.
Question:
1. Is it that a token range(vnode) is a fixed range. Once set, this token range will be copied to other Cassandra nodes to satisfy the replication factor like a token range(vnode) being a fundamental chunk of data(tokens) which goes around together. Only in case of bootstrap of a new node in the cluster, this token range(vnode) might break apart to be assigned to other node.
Riding on the last proposition, (say the last proposition is true).
Then a vnode must only contain tokens which belong a given keyspace.
Because each keyspace(container of column family/tables) has a defined replication strategy and replication factor. And it is highly likely that replication factor of keyspaces in a Cassandra cluster will vary.
Consider an example. "system_schema" keyspace has a RF of 1 whereas I created a keyspace "test_ks" with RF 3. If a row of system_schema keyspace has a token number 2(say) and a row of my test_ks has token number 5(say).
these 2 tokens can't be placed in the same token range(vnode). If a vnode is consistent chunk of token ranges, say token 2 and 5 belong to vnode with token number 10. so vnode 10 has to be placed on 3 different physical nodes to satisfy the RF =3 for test_ks, but we are unnecessary placing token 2 on 3 different nodes whose RF is supposed to be 1.
Is this proposition correct that, a vnode is only dedicated to a given keyspace?
which boils down to out of 256 tokens on a physical node... 20(say) vnodes currently belong to "system" keyspace, 80 vnodes(say) belong to test_ks.
Again riding on the above proposition, this means that each node should have the info of keyspace-wise vnodes currently available in the cluster.
That way when a new write comes in for a Keyspace the co-ordinator node would locate all vnodes in the cluster for that keyspace and assign the new row a token number which falls within the token range of those keyspaces. That being the case can I know how many vnodes currently belong to a keyspace in the entire cluster/ or on a given node.
Please do correct me if I'm wrong.
I have been following the below blogs and videos to get an understanding of this concept:
https://www.scribd.com/document/253239514/Virtual-Nodes-Strategies-for-Apache-Cassandra
https://www.youtube.com/watch?v=GddZ3pXiDys&t=11s
Thanks in advance
There is no fixed token-range, the tokens are just generated randomly. This is one of the reasons that vnodes were implemented - the idea being that if there are more tokens it is more likely that the resulting token-ranges will be more evenly distributed across nodes.
Token generation was recently improved in 3.0, allowing Cassandra to place new tokens a little more intelligently (see CASSANDRA-7032). You can also manually configure tokens (see initial_token), although it can become tricky to keep things balanced when it comes time to expand the cluster unless you plan on doubling the number of nodes.
The total number of tokens in a cluster is the number of nodes in the cluster multiplied by the number of vnodes per node.
In regards to placement of replicas, the first copy of a partition is placed in the node that owns that partition's token. The additional n copies are placed sequentially on the next n nodes in the ring that are in the same data centre. There is no relationship between tokens and keyspaces.
When a new write comes into a coordinator node, the coordinator node determines which node owns the partition by hashing the partition key. Note that for better performance this can actually be done by the driver instead if you use TokenAwarePolicy. The coordinator sends the write to the node that owns the partition, and if the data needs to be replicated the coordinator node also writes the replicas to the next two nodes sequentially in the token-space.
For example, suppose that we have 3 nodes which each have one token: node1: 10, node2: 20 & node3: 30. If we write a record whose partition key hashes to 22, to a keyspace with RF3, then the first copy goes to node2, the second goes to node3 and the third goes to node1. Note that each replica is equally valid - there is nothing special about the "first" replica other than that it happens to be stored on the "first" replica node.
Vnodes do not change this process, they just split up each node's token ranges by allowing each node to have more than one token. For example, if our cluster now has 2 vnodes for each node, it might instead look like this: node1: 10, 25, node2: 20, 3 & node3: 30, 21. Now our write that hashed to 22 goes to node3 (because it owns the range from 21-24), and the copies go to node1 and node2.

What means that Cassandra vnode is non-contiguous?

I am reading datastax tutorial
It says
Within a cluster, virtual nodes are randomly selected and non-contiguous
In this context, what does node being non-contiguous represent?
Looking at the cluster, you can see that the nodes 1, 2, 3 hold, all of them, token A.
The allocation has been done with the following algorithm:
token A to node 1
token B to node 2
token C to node 3
...
token A to node 2
token B to node 3
token C to node 4
...
token A to node 3
token B to node 4
token C to node 5
...
So, partitions are allocated consecutively on neighboring nodes on the cluster. Or you can say in a contiguous manner. So if you lose nodes 1, 2, 3 you will completely lose data in token A.
When using vnodes, the token ranges are allocated randomly within the cluster. The ranges are not allocated on a consecutive manner, either we are talking about nodes or partition range perspective. In that sense you are sure that neighboring partition ranges do not reside on neighboring nodes. This is what non-contiguous manner means.

Token Range Offset Cassandra

We are planning to add a new datacenter to our cluster. We currently have initial_token mentioned in the YAML file for each node. When we add the new datacenter, could we have the same set of token ranges mentioned for the nodes in new datacenter. What impact would it have if we did not offset the token ranges in new datacenter.
DC1 : Node-1 : 0
Node-2 : 25
Node-3 : 50
Node-4 : 75
DC2 : Node-1 : 0
Node-2 : 25
Node-3 : 50
Node-4 : 75
No two nodes can share the same token even if they are in different datacenters. You should try to offset your nodes in a different DC by some value with regards to their counterparts (maybe 100 or so) to accommodate replacing nodes. Usually when you replace a node you fire up a new one with the token of the one you are replacing+1.
This is in older C* 1.1 documentation, but the strategy is explained here:
When adding nodes to a cluster, you must avoid token collisions. You can do this by offsetting the token values, which allows room for the new nodes.
The following graphic shows an example using an offset of +100:
If you would like to consider vnodes for your cluster, here is a good article for vnodes and heterogenous hardware

Resources