Is there any significance to the value used in the 'number of tokens' attribute in the Cassandra's YAML file if all the nodes have got the same value in their respective YAML files? Is it the relative value that makes the difference? For e.g. is there any difference, whatsoever, in the below 2 scenarios(assume a cluster of n nodes):
Case 1: Number of tokens is set as 256 in each of the n nodes.
Case 2: Number of tokens is set as x where x is different than 256 in each of the n nodes.
This value can affect how good data is distributed between nodes - the bigger value is, the more uniform data distribution will be. But this comes at cost of additional overhead because Cassandra will need to maintain all these virtual nodes. Depending on the number of virtual nodes & replication factor, the distribution may vary - for example, for RF=3 & vnodes=8, distribution may vary by ~10%. The recommendation could also different for different versions of Cassandra - for 3.x, recommendation is from 8 to 32. More information you can find in this document.
P.S. If you're using DSE, you may also tweak the allocate_tokens_for_local_replication_factor for better allocation of data.
allocate_tokens_for_local_replication_factor is specific to DSE
Apache Cassandra has the parameter allocate_tokens_for_keyspace
Related
In the Dynamo paper, the author introduced 3 different partitioning strategy:
It seems DynamoDB has evolved from strategy 1 to strategy 3. I have a few questions related to strategy 3:
Refer to:
Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery.
How is it managed at low level? One node can have a few partitions assigned to it. Is each partition handled separately inside the storage engine? For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces? This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage? For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node? How do we handle this situation in DynamoDB?
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Trying to answer each question in turn:
One node can have a few partitions assigned to it.
Each node will have 1 or more token ranges assigned during bootstrapping - depending on the partitioner this is a numeric range -2^63 to +2^63 for the murmur or 0 to 2^128 for random partitioner.
Each token here can contain a partition (but might not), so while you are thinking of it as the node owning partitions, strictly speaking it is owning token ranges.
Is each partition handled separately inside the storage engine?
This question doesn't really follow - an SSTable can contain 1 or more partitions. A partition can be contained in 1 or more SSTables - e.g. the partition span SSTables.
For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces?
No, there will be a memtable for the database table, and then these are flushed to create the SSTables - the compaction of the multiple SStables is determined by the compaction strategy setting, with there being quite different behaviours and advantages / disadvantages to each, depending on the usage scenario. 1 size, does not fit all. Again, each SSTable can contain multiple partitions, and a partition can appear in more than 1 SSTable.
This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
Compaction itself is not a trivial topic, but since the initial premise is not correct, it has not introduced this.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage?
Writing specifically about Cassandra - every time you add or remove a node the token ranges that belong to each node can and will alter. So it is not entirely 'static', but it is not easy to change or manipulate either.
For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node?
Again - specific to Cassandra, in theory yes - you calculate the hash value of the partition key, and use initial_token values on a node to give it a very narrow range. In practice no - this is a data model design issue, by the fact that its partitioned in a way which has created a hot spot.
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Using num_tokens, which creates vNodes - is in effect dividing the consistent hash ring up more times, so 10 nodes, num_tokens = 16, means that the overall token range is divided into 160 slices, with each node having 10 of them as their partition range. They will hold replicas of other node's ranges of course based on replication factor and rack assignments. If you only had RF=1, then they would only be storing data for the range(s) they are assigned.
Initial_tokens is the setting to control the initial value when the node is bootstrapped - you can choose to calculate it and set it manually, or you can let the partitioner calculate it for you. Further changes on that setting after bootstrap will not have an impact.
I am going through cassandra tutorials and come across this picture that represents multinode cassandra cluster -
Isnt total number of tokens ( in the above 256 ) should be distributed across all three nodes around 85 tokens each ?
No, the num_tokens parameter specifies how many tokens ranges each node will handle. From cassandra.yaml description:
This defines the number of tokens randomly assigned to this node on the ring. The more tokens, relative to other nodes, the larger the proportion of data that this node will store. You probably want all nodes to have the same number of tokens assuming they have equal hardware capability.
Otherwise, what would happen if you have cluster with more than 256 nodes? ;-)
Although it is asked many times and answered many times, I did not find a good answer anyway.
Neither in forums nor in cassandra docs.
How do virtual nodes work?
Suppose a node having 256 virtual nodes.
And docs say they are distributed randomly.
(put away how that "randomly" done...I have another,more urgent question):
Is that right that every cassandra node ("physical") actually responsible for several distinct locations in the ring? (for 256 locations)? Does that mean the "physical" node sort of "spread" on the whole circle?
How in that case re-balancing works? If I add a new node?
The ring will get an additional 256 nodes.
How those additional nodes will divide the data with the old nodes?
Will they, basically, appear as additional "bicycle spokes" randomly spread through the whole ring?
A lot of info on the internet, but nobody makes a clear explanation...
Vnodes break up the available range of tokens into smaller ranges, defined by the num_tokens setting in the cassandra.yaml file. The vnode ranges are randomly distributed across the cluster and are generally non-contiguous. If we use a large number for num_tokens to break up the token ranges, the random distribution means it is less likely that we will have hot spots.Using statistical computation, the point where all clusters of any size always had a good token range balance was when 256 vnodes were used. Hence, the num_tokens default value of 256 was the recommended by the community to prevent hot spots in a cluster.
Ans 1:- It is a range of tokens based on num_tokens. if you have set 256 the you will get 256 token ranges which is default.
Ans 2:- Yes, when you are adding or removing the nodes the tokens will distribute again in the cluster based on vnodes configurations.
you may refer for more details are here https://docs.datastax.com/en/ddac/doc/datastax_enterprise/dbArch/archDataDistributeVnodesUsing.html
LetsNoSQL answer is correct. See also https://stackoverflow.com/a/37982696/5209009. I'll only add a few more comments:
Yes, the "physical" node is spread on the token range.
As explained in the link, any new node will take 256 new token ranges, dividing some of the existing ones. There is no other rebalancing, it relies on randomness to achieve some rebalancing, that's why it's using a relatively large (256) number of tokens per node.
It's worth mentioning that there is another option. You can run vnodes with a smaller number of tokens per node (4-8) with a token allocation algorithm. Any new tokens will not be allocated randomly, a greedy algorithm will be used so that the new tokens will create a distribution that optimises the load on a given keyspace. It will simply divide in half the token ranges containing most of the data. Since it's not random it can work with a smaller number of tokens (4-8). It's not really relevant for small clusters, but for 100+ nodes it can be.
See https://www.datastax.com/blog/2016/01/new-token-allocation-algorithm-cassandra-30 and https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html.
When creating a new namespace in Cassandra, we need to give a number for a replication factor.
Ex:
Does the number, that we are giving as the replication factor, determine the number of nodes that initially create to store the replicate data?
Can anybody give a clear clarification about what that replication factor does?
It will not create the number of nodes specified. It just means the number of copies of data. For instance if your cluster is having 5 nodes, your write will be replicated(written) to 3 different nodes depending on the token range it falls. Coming to SimpleStrategy its asn implementation where it does not consider rack or dc's into consideration when replicating.
The explanation #Praneeth Gudumasu given for replication_factor is true. The number of nodes in a Cassandra cluster is not something you "give", you can actually connect as many number of nodes as you wish: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html
and each time you connect a new node it is assigned a token range as per Cassandra's architecture. If you don't know how many nodes you need for your application I suggest running a performance test with data size approaching the size you would be inserting in your real application, then try to execute some queries (concurrently) and see with how many nodes you would get a reasonable response time for your queries.
I'm trying to build two 3-node Cassandra clusters in separate data centers. I want to have NetworkToplogyStrategy replication between them, with a replication factor of 3 in each. Thus, I want each node in each data center to have the same records.
Question, what should my token assignment look like for each node? (since i'm not actually partitioning, just replicating).
Thank you!
If you're using Cassandra 1.2 use virtual nodes with automatic assignment.
If you're using 1.1 or earlier, use for one DC the evenly distributed tokens:
0
56713727820156410577229101238628035242
113427455640312821154458202477256070484
(0, 1 and 2 times 2**127/3)
For the other DC, you can choose anything as long as it is also evenly distributed. Offsetting by 1 works:
1
56713727820156410577229101238628035243
113427455640312821154458202477256070485
Although for now the tokens don't matter since all nodes hold the same data, if you want to scale in the future it will help to have them already balanced.