Partitioning for cassandra nodes, using Murmur3Partitioner - cassandra

In my project, I use cassandra 2.0, and have 3 database servers.
2 of 3 servers has 2 TB of hard drive, the last has just 200 GB. So, I want the 2 servers response for higher load than the last one.
Cassandra: I use Murmur3Partitioner to partition the data.
My question is: how can I calculate the initial_token for each cassandra instance?
Thanks for your help :)

If you are using a somewhat recent version of Cassandra (2.x) then you can configure the number of tokens a node should hold relative to other nodes in the cluster. There is no need to specify token range boundaries via the initial_token any more. Instead you give a node a "weight" through the num_tokens parameter. As the capacity of your smaller node is roughly 1/10th of the big ones, adjust the weight of that node accordingly. The default weight is 256. So you could start with a weight of 25 for the smaller node and try and see whether it works OK that way.

Murmur3Partitioner : Uniformly distribute the data across the clusters based on the MurmurHash hash value.
Murmur3Partitioner uses a maximum possible range of hash values from -263 to +263-1. Here is the formula to calculate tokens:
python -c 'print [str(((264 / number_of_tokens) * i) - 263) for i in range(number_of_tokens)]'
For example, to generate tokens for 10 nodes:
python -c 'print [str(((264 / 10) * i) - 263) for i in range(10)]'

Related

Adding new DC to a cluster that uses initial_token/allocate_tokens_for_keyspace allocation scheme

We have a 3 DC cluster running Cassandra 3.10. Each DC has 24 nodes total with 8 tokens per node and 3 seed nodes. We use Murmur3Partitioner.
In order to ensure better data distribution, the cluster was created using tokens allocation approach where you manually specify initial_token for seed nodes and use allocate_tokens_for_keyspace for non seed nodes.
Now I need to add another DC to the cluster using the same tokens allocation approach, but I can't figure out how to calculate initial_token for the new seed nodes. My naive approach was to copy token values from one of the existing DCs only to discover that initial token values must be unique across the whole cluster.
So, now I'm kinda lost on how to proceed. Any help will be appreciated, thanks.
Correct, token range assignments must be unique.
In the days before VNodes and automatic token range calculations, we used to have to calculate token ranges manually. For multiple data centers, we would use the same token ranges offset by 1.
Example: If you use num_tokens: 4 and have 6 nodes in your DC, your nodes' token ranges might look something like this:
node1: 8454757700450209793, -5380300354831952895, -768614336404564991, 3843071682022821889
node2: -9223372036854775807, -4611686018427387903, 1, 4611686018427387905
node3: -8454757700450211199, -3843071682022823935, 768614336404563969, 5380300354831951873
node4: -7686143364045646591, -3074457345618258943, 1537228672809127937, 6148914691236515841
node5: -6917529027641081855, -2305843009213693951, 2305843009213693953, 6917529027641081857
node6: -6148914691236517375, -1537228672809129983, 3074457345618257921, 7686143364045645825
If you added a second DC, then the ranges for the new 6 nodes in the other DC would look like this:
node1: 8454757700450209794, -5380300354831952894, -768614336404564990, 3843071682022821890
node2: -9223372036854775806, -4611686018427387902, 2, 4611686018427387906
node3: -8454757700450211198, -3843071682022823934, 768614336404563970, 5380300354831951874
node4: -7686143364045646590, -3074457345618258942, 1537228672809127938, 6148914691236515842
node5: -6917529027641081854, -2305843009213693950, 2305843009213693954, 6917529027641081858
node6: -6148914691236517374, -1537228672809129982, 3074457345618257922, 7686143364045645826
If you'd have to do this for another DC, then you would offset their starting token ranges by 2.

Nodetool load and own stats

We are running 2 nodes in a cluster - replication factor 1.
After writing a burst of data, we see the following via node tool status.
Node 1 - load 22G (owns 48.2)
Node 2 - load 17G (owns 51.8)
As the payload size per record is exactly equal - what could lead to a node showing higher load despite lower ownership?
Nodetool status uses the Owns column to indicate the effective percentage of the token range owned by the nodes. While GB is Size of your records
Dont see anything wrong here. Your data is almost evenly distributed around your two nodes which is exactly what you want for perfekt performance.

Cassandra Vnodes and token Ranges

I know that Vnodes form many token ranges for each node by setting num_tokens in cassandra.yaml file.
say for example (a), i have 6 nodes, each node i have set num_token=256. How many virtual nodes are formed among these 6 nodes that is, how many virtual nodes or sub token ranges contained in each physical node.
According to my understanding, when every node has assigned num_token as 256, then it means that all the 6 nodes contain 256 vnodes each. Is this statement true? if not then, how vnodes form the range of tokens (obviously random) in each node. It would be really convenient if someone can explain me with the example mentioned as (a).
what is the Ring of Vnodes signify in this url:=> http://docs.datastax.com/en/cassandra/3.x/cassandra/images/arc_vnodes_compare.png (taken from: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 )
Every partition key in Cassandra is converted to a numerical token value using the MurMur3 hash function. The token range is between -2^63 to +2^63 -1
num_token defines how many token ranges are assigned to a node. this is the same as the signed java long. Each node calculates 256 (num_tokens) random values in the token range and informs other nodes what they are, thus when a node needs to coordinate a request for a specific token it knows which nodes are responsible for it, according to the Replication Factor and DC/rack placement.
A better description for this feature would be "automatic token range assignment for better streaming capabilities", calling it "virtual" is a bit confusing.
In your case you have 6 nodes, each set with 256 token ranges so you have 6*256 token ranges and each psychical node contains 256 token ranges.
For example consider 2 nodes with num_tokens set to 4 and token range 0 to 100.
Node 1 calculates tokens 17, 35, 77, 92
Node 2 calculates tokens 4, 25, 68, 85
The ring shows the distribution of token ranges in this case
Node 2 is responsible for token ranges 4-17, 25-35, 68-77, 85-92 and node 1 for the rest.

Write Request metric

I'm currently using 1-node cluster with DataStax Opscenter 5.2.1 (Cassandra 2.2.3) installed on Windows.
There is not too much data is sent to the cluster, and here is the graph (last 20 minutes) of write requests that I can see in Opscenter. The graph looks normal and expected for me:
write_requests(20min)
However, when I've switched the data range to last 1 hour, as turns out there were much more write requests (according to cluste(max) line):
write_requests(1h)
I'm confused, could someone clarify what cluster(max) means in my case? Why these values are so big in comparison with cluster(total) or cluster(min)?
The first graph (20 minute) uses an average. The 1h graph will have 3 lines - min per sample, average, and max per sample.
What you're likely seeing is that something (perhaps opscenter itself) is doing a flood of writes, about 700/second for a few seconds, and on the 20 minute graph it gets averaged out, but with the min/max lines, you'll see the outliers.

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Resources