Recommended number of partitions in Cassandra - cassandra

Although Cassandra allows -2^63 to +2^63-1 number of paritions, is there a recommended max number of partitions beyond which performance might suffer?

After about 1 billion partitions per node full repairs (non incremental) begin to have pretty serious issues with over streaming. Particularly with smaller partitions as the validation compactions run slower.
Ideally i would recommend it by partition size not count. Somewhere around 100mb partitions and you will have more efficient compactions without too much of the expensive overhead of the partition index on reads. I wouldn't be too strict on it though as its very hand wavey on a lot of factors. Try to focus on modeling for your queries first then fine tune it if the said model ends up having too large or too many too small partitions (hundreds of millions or more sub 1k or any multi gb ~ish -- per node not total)

Related

Wide partitions and large number of partitions in Cassandra

We are currently facing wide Partitions for certain customers. We have data in Gbs for those partitions. We tried data modelling and partitions always seems to be skewed. We were trying to use bucketing logic to minimise the partition size. Either large number of partitions are generated for low resource consuming users or Wide partitions are generated for high resource consumers.
Large number of partitions lead to heap memory bloat while wide partitions lead to slower reads.
How can i solve a situation like this ?

Cassandra read latency increases while writing

I have a cassandra cluster, its read latency increases during writes. The writes mostly happen via spark jobs during the night time. The writes happen in huge bursts, is there a way to reduce read latency during the writes. The writes happen using LOCAL_QUORUM and reads happen using LOCAL_ONE. Is there a way to reduce read latency when writes are happening?
Cassandra Cluster Configs
10 Node cassandra cluster (5 in DC1, 5 in DC2)
CPU: 8 Core
Memory: 32GB
Grafana Metrics
I can give some advice:
Use LCS compaction strategy.
Prefer round-robin load balancing policy for reads.
Choose partition_key wisely so that requests are not bombarded on a single partition.
Partition size also play a good role. Cassandra recommends to have smaller partition size. However, I have tested with Partitions of 10000 rows each with each row having size of 800 bytes. It worked better than with 3000 rows(or even 1 row). Very tiny partitions tend to increase CPU usage when data stored is large in terms of row count. However, very large partitions should be avoided even.
Replication Factor should be chosen strategically . Write consistency level should be decided considering the replication of all keyspaces.

drawbacks to large spark partition sizes

I have read that too many small partitions hurt performance because of overhead, e.g. sending a very large number of tasks to executors.
What are the downside of using maximally large partitions, e.g. why do I see recommendations in the 100s of MB range?
I can see a few potential issues:
If you lose a partition, there's a large amount of work to recompute. With many smaller partitions you may lose more often, but you will have less variance in your runtime.
If one of your few tasks on large partitions takes longer to compute than the others, this would would leave other cores un-utilized, but with smaller partitions, this can better distribute this across the cluster.
Do these issues make sense, and are there others? Thanks!
These two potential issues are correct.
For a better cluster usage, one should define partitions large enough to compute an HDFS block (128 / 256 MB in general) but avoid exceeding it for a better distribution allowing horizontal scaling for performance (maximazing CPU usage).
As for the first point, you can not assume that the variance in runtime will be less if you have smaller and large number of partitions. Let's say one of the node crashes which will result in the recomputation of the rdd partition but now you have one less node to process the data your runtime will increase irrespective of the number of partitions.
If one of your few tasks on large partitions takes longer to compute than the others It happens if you have skewed data and increasing number of partitions can solve this problem but simply increasing the number of partitions isn't always sufficient.
The max partition size should not be greater than 128M which is default block size in hdfs. But you should not also have very small size partition as it add scheduling multiple tasks overhead and maintaining large meta data as well. Similar to any multithreaded application increasing the parallelism doesn't always increase performance. And in the end it comes down to finding that optimal value for which you get max performance.
By having large partition size you will have:
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
refer
Please refer : here to find optimal number of partitons.

What is a performant partitioning strategy for key-agnostic mapping?

First of all I'm working with PySpark on Glue and I'm reading several very large CSV files. Those CSVs are bzip2 compressed and inflated several GB large.
At this stage of processing I'm only performing a simple map over all rows. No joins, group bys, filtering. Just a map.
Let's say I am working on 10 nodes. Generally speaking, would it be preferable to have a rather high number of partitions or a rather low number?
I would guess that independent of available cores on all those nodes that number should be pretty high to make sure that every executor is busy at all times having small chunks of data available.
So, let's say there are 20 cores on those 10 nodes and let's for a second assume there are key-based partitions then something larger than 40 would likely not be a good idea. But in the key-agnostic mapping case I'd tend to something like 1000 partitions or more.
Does that make sense? I'm especially interested in the thought process here.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Resources