how to achieve even distribution of partitions keys in Cassandra - cassandra

We modeled our data in cassandra table with partition key, lets say "pk". We have a total of 100 unique values for pk and our cluster size is 160. We are using random partitioner. When we add data to Cassandra (with replication factor of 3) for all 100 partitions, I noticed that those 100 partitions are not distributed evenly. One node has as many as 7 partitions and lot of nodes only has 1 or no partition. Given that we are using random partitioner, I expected the distribution to be reasonably even. Because 7 partitions are in the same node, thats creating a hot partition for us. Is there a better way to distribute partitions evenly?
Any input is appreciated.
Thanks

I suspect the problem is the low cardinality of your partition key. With only 100 possible values, it's not unexpected that several values end up hashing to the same nodes.
If you have 160 nodes, then only having 100 possible values for your partition key will mean you aren't using all 160 nodes effectively. An even distribution of data comes from inserting a lot of data with a high cardinality partition key.
So I'd suggest you figure out a way to increase the cardinality of your partition key. One way to do this is to use a compound partition key by including some part of your clustering columns or data fields into your partition key.
You might also consider switching to the Murmur3Partitioner, which generally gives better performance and is the current default partitioner on the newest releases. But you'd still need to address the low cardinality problem.

Related

Cassandra partition technique

From my understanding Apache Cassandra partitions each row in a table into a separate partition located in separate nodes. In that case, if we consider a table having millions of records or rows, Cassandra will partition the records to millions of Nodes.
My doubt is "What if adequate nodes are not available to store each record in case of a table with millions of records which is continuously growing?"
Your understanding is wrong. The three main keywords used in your question are partition, rows and node. Now consider how are they defined
Node represents the Cassandra process running on a virtaul machine/baremetal/cloud.
Partition represents a logical entity which helps Cassandra cluster to know on which node requested data resides. Primary key should be unique.
Row represent a record contained within a partition. A partition can contain millions of rows.
Based on your partition key your Cassandra cluster will identify on which node the data will reside. If you have three nodes, then Cassandra will take hash of your partition key and based on that value node will be identified where data will be written. So as you scale, hash numbers will be redistributed (along with them partitions will be distributed).
So even if you millions of records, they can reside in single node if your Cluster has one node and if you multiple nodes, your data will be distributed almost equally among nodes.

Cassandra Partition vs NoSql Partition

I've understood difference b/w Cassandra Partition key, Composite key, Clustering key. But not finding enough information to understand how partition is handled in cassandra.
In cassandra, range of partition keys are stored on a node like a partition/shard. Is my understanding is correct or not..?
Is each partition key has different file(at the system level) in DB..? If so, won't the reads be slower..?
If each partition key is not having different file in DB. How it's handled..?
Data is stored in Cassandra in wide rows called partitions. Each row has a partition key used for identifying that partition. For distributing the data across the cluster, Cassandra is using partitioners which are basically computing hashes of the partition key and the data is distributed across the cluster based on these values. The default partitioner in Cassandra is Murmur3Partitioner.
At OS level, the data is stored in sstables files. A partition can be spread across many sstables. That's why you also need compaction, which is the process of consolidating those sstables, so your partitions won't be spread across a lot of sstables. Reducing the number of sstables a partitions is spread across, will also improve read time. It's worth noting that sstables are immutable.
I suggest reading this, especially "How Cassandra reads and writes data".

Cassandra - Parameter responsible for number of partitions

After going through multiple websites, partition key in cassandra is responsible for identifying the node in the cluster where it stores data. But I don't understand on what parameter number of partitions are created(like keyspace responsible for Replication Factor) in cassandra..! or it creates partitions based on murmur3 without being able to specifying partitions explicitly
Thanks in Advance
Cassandra by default uses partitioner based on Murmur3 hash that generates values in range from -2^63 to 2^63-1. Each node in cluster is responsible for particular range of hash values, and data with partition key hashed to values in this range go to that node(s). I recommend to read documentation about Cassandra/DSE architecture - it will make things easier to understand.

How Cassandra Handle a select query?

I'm working on designing Cassandra column family.
I met with a situation of higher GC while SELECTing, after loading a higher density of data. That is, amount of data in a partition increased. Also for low density data, it works fine.
I want to know how Cassandra does the SELECT query (with both partition and cluster key specified)?
Is the whole set of data in a partition is loaded into memory while we execute SELECT?
Will large number of partition keys affect performance?
Cassandra does not load the entire partition into memory, but it does load IndexInfo objects which help Cassandra find the relevant CQL rows within the partition. These are short lived java objects which can create quite a bit of heap pressure (GC pauses) - this is a design issue that will be addressed in CASSANDRA-9754 (known as Birch, a b-tree implementation of the index data structure).
Until cassandra-4.0 is released, you should target 100MB for your max partition size, and break larger partitions into smaller pieces.

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:
Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.
If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.
Is that correct?
Or how spark handle such situations
Does spark keep all elements (...) for a particular key in a single partition after groupByKey
Yes, it does. This is a whole point of the shuffle.
the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do
Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.
All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.
In general:
It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.
Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

Resources