Cassandra vnodes replicas - cassandra

Setting up the context:
Cassandra currently implements vnodes. 256 by default which is tweakable in the cassandra.yaml file
Vnodes as I understand are token-ranges/hash-ranges. Eg. (x...y], where y is the token number of the vnode. Each physical node in Cassandra is assigned random 256 tokens, and each of those tokens are the boundary value of a hash/token range. The tokens assigned are within the range of 2^-63 to 2^63-1 (the range of hash numbers which murmur3 has partitioner may generate). So far so good.
Question:
1. Is it that a token range(vnode) is a fixed range. Once set, this token range will be copied to other Cassandra nodes to satisfy the replication factor like a token range(vnode) being a fundamental chunk of data(tokens) which goes around together. Only in case of bootstrap of a new node in the cluster, this token range(vnode) might break apart to be assigned to other node.
Riding on the last proposition, (say the last proposition is true).
Then a vnode must only contain tokens which belong a given keyspace.
Because each keyspace(container of column family/tables) has a defined replication strategy and replication factor. And it is highly likely that replication factor of keyspaces in a Cassandra cluster will vary.
Consider an example. "system_schema" keyspace has a RF of 1 whereas I created a keyspace "test_ks" with RF 3. If a row of system_schema keyspace has a token number 2(say) and a row of my test_ks has token number 5(say).
these 2 tokens can't be placed in the same token range(vnode). If a vnode is consistent chunk of token ranges, say token 2 and 5 belong to vnode with token number 10. so vnode 10 has to be placed on 3 different physical nodes to satisfy the RF =3 for test_ks, but we are unnecessary placing token 2 on 3 different nodes whose RF is supposed to be 1.
Is this proposition correct that, a vnode is only dedicated to a given keyspace?
which boils down to out of 256 tokens on a physical node... 20(say) vnodes currently belong to "system" keyspace, 80 vnodes(say) belong to test_ks.
Again riding on the above proposition, this means that each node should have the info of keyspace-wise vnodes currently available in the cluster.
That way when a new write comes in for a Keyspace the co-ordinator node would locate all vnodes in the cluster for that keyspace and assign the new row a token number which falls within the token range of those keyspaces. That being the case can I know how many vnodes currently belong to a keyspace in the entire cluster/ or on a given node.
Please do correct me if I'm wrong.
I have been following the below blogs and videos to get an understanding of this concept:
https://www.scribd.com/document/253239514/Virtual-Nodes-Strategies-for-Apache-Cassandra
https://www.youtube.com/watch?v=GddZ3pXiDys&t=11s
Thanks in advance

There is no fixed token-range, the tokens are just generated randomly. This is one of the reasons that vnodes were implemented - the idea being that if there are more tokens it is more likely that the resulting token-ranges will be more evenly distributed across nodes.
Token generation was recently improved in 3.0, allowing Cassandra to place new tokens a little more intelligently (see CASSANDRA-7032). You can also manually configure tokens (see initial_token), although it can become tricky to keep things balanced when it comes time to expand the cluster unless you plan on doubling the number of nodes.
The total number of tokens in a cluster is the number of nodes in the cluster multiplied by the number of vnodes per node.
In regards to placement of replicas, the first copy of a partition is placed in the node that owns that partition's token. The additional n copies are placed sequentially on the next n nodes in the ring that are in the same data centre. There is no relationship between tokens and keyspaces.
When a new write comes into a coordinator node, the coordinator node determines which node owns the partition by hashing the partition key. Note that for better performance this can actually be done by the driver instead if you use TokenAwarePolicy. The coordinator sends the write to the node that owns the partition, and if the data needs to be replicated the coordinator node also writes the replicas to the next two nodes sequentially in the token-space.
For example, suppose that we have 3 nodes which each have one token: node1: 10, node2: 20 & node3: 30. If we write a record whose partition key hashes to 22, to a keyspace with RF3, then the first copy goes to node2, the second goes to node3 and the third goes to node1. Note that each replica is equally valid - there is nothing special about the "first" replica other than that it happens to be stored on the "first" replica node.
Vnodes do not change this process, they just split up each node's token ranges by allowing each node to have more than one token. For example, if our cluster now has 2 vnodes for each node, it might instead look like this: node1: 10, 25, node2: 20, 3 & node3: 30, 21. Now our write that hashed to 22 goes to node3 (because it owns the range from 21-24), and the copies go to node1 and node2.

Related

Cassandra Token Distribution

I am going through cassandra tutorials and come across this picture that represents multinode cassandra cluster -
Isnt total number of tokens ( in the above 256 ) should be distributed across all three nodes around 85 tokens each ?
No, the num_tokens parameter specifies how many tokens ranges each node will handle. From cassandra.yaml description:
This defines the number of tokens randomly assigned to this node on the ring. The more tokens, relative to other nodes, the larger the proportion of data that this node will store. You probably want all nodes to have the same number of tokens assuming they have equal hardware capability.
Otherwise, what would happen if you have cluster with more than 256 nodes? ;-)

The replication factor in cassandra and later change

We have a 3 node Cassandra cluster, and we created a keyspace with a replication factor of 1. After we altered the keyspace's replication factor to 2 (and performed a nodetool repair for that key-space), I see it shares a Merkle tree with all three nodes.
My question, is why would it share with 3 nodes, and not only with 2 nodes?
When triggering a token redistribution, all nodes must be contacted. This is necessary, because all nodes must share an even amount of token range responsibility. To accomplish that, the token range responsibility will change per node; especially with such a small cluster.
Case-in-point, you had a 3 node cluster with a RF of 1. That means each node was responsible for 33.33% of the total tokens (-2^63 to 2^63 -1).
By increasing the RF from 1 to 2, while keeping the number of nodes constant, you are effectively doubling the amount data your cluster will store. Therefore, each node is now responsible for 66.67% of the data.
If you were to further increase your RF to 3, then each node would be responsible for 100% of your data, effectively storing all of your data on each node.
The reason it's talking to all of the other nodes is that some rows will go from node 1 to node 2, others from node 1 to node 3, some from node 2 to node 1, etc. Each row potentially will get redistributed, and where they start/end includes the entire nodes in the data center. Each row get's re-calculated to where it belongs. Does that make sense?

The replication factor in Cassandra when creating a keyspace

When creating a new namespace in Cassandra, we need to give a number for a replication factor.
Ex:
Does the number, that we are giving as the replication factor, determine the number of nodes that initially create to store the replicate data?
Can anybody give a clear clarification about what that replication factor does?
It will not create the number of nodes specified. It just means the number of copies of data. For instance if your cluster is having 5 nodes, your write will be replicated(written) to 3 different nodes depending on the token range it falls. Coming to SimpleStrategy its asn implementation where it does not consider rack or dc's into consideration when replicating.
The explanation #Praneeth Gudumasu given for replication_factor is true. The number of nodes in a Cassandra cluster is not something you "give", you can actually connect as many number of nodes as you wish: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html
and each time you connect a new node it is assigned a token range as per Cassandra's architecture. If you don't know how many nodes you need for your application I suggest running a performance test with data size approaching the size you would be inserting in your real application, then try to execute some queries (concurrently) and see with how many nodes you would get a reasonable response time for your queries.

Significance of Vnodes in Cassandra

From the url: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2, they say:
"If instead we have randomized vnodes spread throughout the entire cluster, we still need to transfer the same amount of data, but now it’s in a greater number of much smaller ranges distributed on all machines in the cluster. This allows us to rebuild the node faster than our single token per node scheme."
Itseems that the above sentence convey, when we replace a dead node with a new node with same num_tokens (say num_tokens:4), then replaced node contain the same token value as the dead node had before releasing those token values.
But Vnodes generates random token values for every node, then how is it possible to replace a node with same Vnodes token value?
The URL seems confusing in explaining the concept of replacing a dead node with new node using the concept of VNODES. It would be nice if someone can clarify how a Vnode is used for replacing the dead node with exact token value ranges.
Thanks in advance.
First, the vnode parameter num_tokens should be set to a small number, the current recommendation from DataStax is eight (8). The original default was 256, which experience found to be too high.
With traditional token ranges, you only have as many ranges as nodes. But, using vnodes, the number of token ranges is virtualized and much larger. You can not mix vnodes and token ranges in the same data center (ring).
Node Failure With Token Ranges:
In this DataStax example above with token ranges, data for ranges C, D and E reside on just three nodes:
Range C is owned by node 3 and replicated on nodes 4 and 5
Range D is owned by node 4 and replicated on nodes 5 and 6
Range E is owned by node 5 and replicated on nodes 6 and 1
In this example, when node 5 fails, Ranges C, D, and E are reloaded and streamed from only three of the remaining five nodes: 1, 3 and 4. Nodes 2 doesn't have any of the node 5 data and node 6 has the same data being streamed by node 1. Thus nodes 2 and 5 are idle during the rebuilding.
Node Failure With Vnodes:
However, when using vnodes the token ranges are split up into smaller ranges and randomized across the entire cluster of 6 nodes. With smaller ranges, a portion of node 5's data is replicated to every one of the other nodes.
When rebuilding node 5, data can now be streamed from all 5 of the available nodes in the cluster.
The primary advantages of vnodes are:
rebalancing a cluster manualy is no longer required when adding or removing nodes
rebuilding can stream data from all available nodes

How does cassandra find the node that contains the data?

I've read quite a few articles and a lot of question/answers on SO about Cassandra but I still can't figure out how Cassandra decides which node(s) to go to when it's reading the data.
First, some assumptions about an imaginary cluster:
Replication Strategy = simple
Using Random Partitioner
Cluster of 10 nodes
Replication Factor of 5
Here's my understanding of how writes work based on various Datastax articles and other blog posts I've read:
Client sends the data to a random node
The "random" node is decided based on the MD5 hash of the primary key.
Data is written to the commit_log and memtable and then propagated 4 times (with RF = 5).
The 4 next nodes in the ring are then selected and data is persisted in them.
So far, so good.
Now the question is, when the client sends a read request (say with CL = 3) to the cluster, how does Cassandra know which nodes (5 out of 10 as the worst case scenario) it needs to contact to get this data? Surely it's not going to all 10 nodes as that would be inefficient.
Am I correct in assuming that Cassandra will again, do an MD5 hash of the primary key (of the request) and choose the node according to that and then walks the ring?
Also, how does the network topology case work? if I have multiple data centers, how does Cassandra know which nodes in each DC/Rack contain the data? From what I understand, only the first node is obvious (since the hash of the primary key has resulted in that node explicitly).
Sorry if the question is not very clear and please add a comment if you need more details about my question.
Many thanks,
Client sends the data to a random node
It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.
The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.
In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.
So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:
SELECT token(firstname),firstname, id, lastname
FROM usersbyfirstname WHERE firstname='Mal';
...returns this row:
token(firstname) | firstname | id | lastname
----------------------+-----------+----+-----------
4016264465811926804 | Mal | 2 | Reynolds
By running a nodetool ring I can see which node is responsible for this token:
192.168.1.22 rack1 Up Normal 348.31 KB 3976595151390728557
192.168.1.22 rack1 Up Normal 348.31 KB 4142666302960897745
Or even easier, I can use nodetool getendpoints to see this data:
$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
192.168.1.22
For more information, check out some of the items linked above, or try running nodetool gossipinfo.
Cassandra uses consistent hashing to map each partition key to a token value. Each node owns ranges of token values as its primary range, so that every possible hash value will map to one node. Extra replicas are then kept in a systematic way (such as the next node in the ring) and stored in the nodes as their secondary range.
Every node in the cluster knows the topology of the entire cluster, such as which nodes are in which data center, where they are in the ring, and which token ranges each nodes owns. The nodes get and maintain this information using the gossip protocol.
When a read request comes in, the node contacted becomes the coordinator for the read. It will calculate which nodes have replicas for the requested partition, and then pick the required number of nodes to meet the consistency level. It will then send requests to those nodes and wait for their responses and merge the results based on the column timestamps before sending the result back to the client.
Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token. Replicas will be selected by the data replication strategy. Basically this works by going clockwise in the token ring, starting from the primary, and stopping depending on the number of required replicas.
What's important to realize is that each node in the cluster is able to identify the nodes responsible for a certain key based on the logic described above. Whenever a value is written to the cluster, the node accepting the request (the coordinator node) will know right away the nodes that need to execute the write.
In case of multiple data-centers, all keys will be mapped across all DCs to the exact same token determined by the partitioner. Cassandra will try to write to each DC and each DC's replicas.

Resources