Even Data distribution in Cassandra

Even Data distribution in Cassandra - cassandra

I'm new to Cassandra, and I'm stuck at one point.
Consider I have a 5 node cluster with an RF=1 (for simplicity)
Token Ranges
==============
N1 : 1-100
N2 : 101-200
N3 : 201-300
N4 : 301-400
N5 : 401-500
I have a keyspace with 10 partition keys:
ID (PartitionKey) | Name
------------------------
1 Joe
2 Sarah
3 Eric
4 Lisa
5 Kate
6 Agnus
7 Lily
8 Angela
9 Rodger
10 Chris
10 partition keys ==> implies ==> 10 hash values
partitionkey ==> token generated
=================================
1 289 (goes on N3)
2 56 (goes on N1)
3 78 (goes on N1)
4 499 (goes on N5)
5 376 (goes on N4)
6 276 (goes on N3)
7 2 (goes on N1)
8 34 (goes on N1)
9 190 (goes on N2)
10 68 (goes on N1)
If this is the case, then:
N1 has the partition keys : 2,3,7,8,10
N2 has the partition keys : 9
N3 has the partition keys : 1,6
N4 has the partition keys : 5
N5 has the partition keys : 4
So we see that N1 is loaded compared to others, the other nodes (as per my understanding).
Please help me understand how data is evenly distributed in Cassandra, w.r.t Partitioners and consistent hashing.

There is some truth to what you're posting here, mainly because data distribution via hashing is tough with smaller numbers. But let's add one assumption... Let's say we use vNodes, with num_tokens: 4* set in the cassandra.yaml.
So with this new assumption, token range distribution likely looks more like this:
Token Ranges
==============
N1 : 1-25, 126-150, 251-275, 376-400
N2 : 26-50, 151-175, 276-300, 401-425
N3 : 51-75, 176-200, 301-325, 426-450
N4 : 76-100, 201-225, 326-350, 451-475
N5 : 101-125, 226-250, 351-375, 476-500
Given this distribution, your keys are now placed like this:
N1 has the partition keys : 5, 7
N2 has the partition keys : 1, 6, 8
N3 has the partition keys : 2, 9, 10
N4 has the partition keys : 3
N5 has the partition keys : 4
Now figure-in that there is a random component to the range allocation algorithm, and the actual distribution could be even better.
As with all data sets, the numbers get better as the amount of data increases. I'm sure that you'd see better distribution with 1000 partition keys vs. 10.
Also, as the size of your data set increases, data distribution will benefit from new nodes being added with setting allocate_tokens_per_keyspace. This will allow the token allocation algorithm to make smart decisions (less random) about token range assignment based on your keyspace's replication factor.
*Note: Using vNodes with num_tokens: 4 is considered by many Cassandra experts to be an optimal production setting. With the new algorithm, the default of 256 tokens is quite high.

Selecting the partition key is very important in having even distribution of data among all the nodes. The partition key is supposed to be something that has very high cardinality.
For example, in a 10 node cluster, selecting state of a specific country as partition key may not be very ideal since there’s very high chance of creating hotspots, especially when the number of records itself may not be even across states. Whereas choosing something like zip code may be better or even better than that would be something like customer name or ordernumber.
You can explore having a composite partition key if it helps your use case.

In Cassandra data is distributing based on partition and hashing algorithm. We have many other parameters to configure for data distribution and replication such as replication factor, Replication strategy, Snitch etc. Below is the standard recommended document.
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/architecture/archDataDistributeAbout.html

Related

Confusion regarding number of tablets in YugabyteDB cluster

[Question posted by a user on YugabyteDB Community Slack]
I have one question regarding the number of tablets for a table.
I am using YSQL API, my cluster is having 3 nodes with an RF of 3 and each node has 16 cores.
I haven't specified the number of shards per table using SPLIT INTO N TABLETS syntax so I guess the number of tablets will be decided by the cores a node has, based on documentation it will be 8 shards per table per node.
In this case, the total shards for a table should be 24(8 x 3)
We have RF=3 as well, so will that mean the total shards after replication will be 72? (24 x 3)
I am confused here, as I have seen only 24 shards in the tserver tablets UI where it's mentioned that 8 shards are the leaders out of 24. Seeing this it seems the 24 shards contain the replicated ones as well.
Please correct my understanding here.
I am using 2.12, latest stable.
More questions related to the same topic, if ysql_num_shards_per_tserver=8, then:
If we create a cluster with 4 nodes with RF 3 , then the total tablets/shards will be 8 x 4 = 32 (without peers)? and 32 x 3 = 96 (including peers) ?
Also, suppose if we add one more node in an existing cluster with 3 nodes, then after node addition, a new 8 tablets/shards will be created for the new node? and then tablets/shards will be rebalanced ? or new tablets/shards are not created and just the rebalancing of existing ones will happen?
num_shards_per_tserver is 8
This is for one table, at the left it shows 24 shards, 8 are leaders and rest are followers

In this case, the total shards for a table should be 24(8 x 3),
No, what is called shard in ysql_num_shards_per_tserver is the number of tablets. Each tablet/shard has 3 tablet peers (one leader and two followers). So, with ysql_num_shards_per_tserver=8 it is expected that you see 8 leaders on a server, and 16 followers which are the tablet peers of the tablets having their followers in the two other servers.
And your screenshots are from one tserver endpoint (:9000/tables) The master endpoint (localhost:7000/table?id=000030af000030008000000000004200) will show you all tablets and their peers
The master shows the tablets (you have 24) and where are their leader and followers (1 leader 2 followers) so 24*4=the number of tablet peers in total. The tserver shows the tablet peers within this server
If we create a cluster with 4 nodes with RF 3 , then the total tablets/shards will be 8 x 4 = 32 (without peers)? and 32 x 3 = 96 (including peers) ?
Yes, it will be 96 total.
Also, suppose if we add one more node in an existing cluster with 3 nodes, then after node addition, a new 8 tablets/shards will be created for the new node?
No new tablets will be created.
or new tablets/shards are not created and just the rebalancing of existing ones will happen?
Yes, only rebalancing. Note that you can use auto-splitting to get new tablets created on a few threshold (> size and < number per node)

Spark Geolocated Points Clustering

I have a dataset of points of interest on the maps like the following:
ID latitude longitude
1 48.860294 2.338629
2 48.858093 2.294694
3 48.8581965 2.2937403
4 48.8529717 2.3477134
...
The goal is to find those clusters of points that are very close to each other (distance less than 100m).
So the output I expect for this dataset would be:
(2, 3)
The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.
Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing.
What approach should I take for this case?

I actually solved this problem using the following 2 approaches:
DBSCAN algorithm implemented as Spark job with partitioning
https://github.com/irvingc/dbscan-on-spark
GeoSpark with spacial distance join
https://github.com/DataSystemsLab/GeoSpark
both of them are based on Spark so they work well with large scale of data.
however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.

I would love to do a cross join here , however that probably won't work since your data is huge.
One approach is to partition the data per region wise. That means you can change the input data as
ID latitude longitude latitiude_int longitude_int group_unique_id
1 48.860294 2.338629 48 2 48_2
2 48.858093 2.294694 48 2 48_2
3 48.8581965 2.2937403 48 2 48_2
4 48.8529717 2.3477134 48 2 48_2
The assumption here if the integral portion of the lat/long changes that will result > 100m deviation.
Now you can partition the data w.r.t group_unique_id and then do a cross join per partition.
This will probably reduce the workload.

Cassandra Vnodes and token Ranges

I know that Vnodes form many token ranges for each node by setting num_tokens in cassandra.yaml file.
say for example (a), i have 6 nodes, each node i have set num_token=256. How many virtual nodes are formed among these 6 nodes that is, how many virtual nodes or sub token ranges contained in each physical node.
According to my understanding, when every node has assigned num_token as 256, then it means that all the 6 nodes contain 256 vnodes each. Is this statement true? if not then, how vnodes form the range of tokens (obviously random) in each node. It would be really convenient if someone can explain me with the example mentioned as (a).
what is the Ring of Vnodes signify in this url:=> http://docs.datastax.com/en/cassandra/3.x/cassandra/images/arc_vnodes_compare.png (taken from: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 )

Every partition key in Cassandra is converted to a numerical token value using the MurMur3 hash function. The token range is between -2^63 to +2^63 -1
num_token defines how many token ranges are assigned to a node. this is the same as the signed java long. Each node calculates 256 (num_tokens) random values in the token range and informs other nodes what they are, thus when a node needs to coordinate a request for a specific token it knows which nodes are responsible for it, according to the Replication Factor and DC/rack placement.
A better description for this feature would be "automatic token range assignment for better streaming capabilities", calling it "virtual" is a bit confusing.
In your case you have 6 nodes, each set with 256 token ranges so you have 6*256 token ranges and each psychical node contains 256 token ranges.
For example consider 2 nodes with num_tokens set to 4 and token range 0 to 100.
Node 1 calculates tokens 17, 35, 77, 92
Node 2 calculates tokens 4, 25, 68, 85
The ring shows the distribution of token ranges in this case
Node 2 is responsible for token ranges 4-17, 25-35, 68-77, 85-92 and node 1 for the rest.

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.

Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Partitioning for cassandra nodes, using Murmur3Partitioner

In my project, I use cassandra 2.0, and have 3 database servers.
2 of 3 servers has 2 TB of hard drive, the last has just 200 GB. So, I want the 2 servers response for higher load than the last one.
Cassandra: I use Murmur3Partitioner to partition the data.
My question is: how can I calculate the initial_token for each cassandra instance?
Thanks for your help :)

If you are using a somewhat recent version of Cassandra (2.x) then you can configure the number of tokens a node should hold relative to other nodes in the cluster. There is no need to specify token range boundaries via the initial_token any more. Instead you give a node a "weight" through the num_tokens parameter. As the capacity of your smaller node is roughly 1/10th of the big ones, adjust the weight of that node accordingly. The default weight is 256. So you could start with a weight of 25 for the smaller node and try and see whether it works OK that way.

Murmur3Partitioner : Uniformly distribute the data across the clusters based on the MurmurHash hash value.
Murmur3Partitioner uses a maximum possible range of hash values from -263 to +263-1. Here is the formula to calculate tokens:
python -c 'print [str(((264 / number_of_tokens) * i) - 263) for i in range(number_of_tokens)]'
For example, to generate tokens for 10 nodes:
python -c 'print [str(((264 / 10) * i) - 263) for i in range(10)]'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string