How to design a big scale VoltDB cluser with dozens of nodes and hundreds of partitions? - voltdb

If I have 32 phsical servers which have 32 cores CPU and 128G memory inside, I want to build a VoltDB cluster with all of those 32 servers with K-Safefy=2 and 32 partitions in each server, so we will get VoltDB cluster with 256 available partitions to save data.
Looks there are too many partitions to split tables especially when some tables don't have a lot of records. But there will be too many copies of table if we choice replica of table.
If we build a much smaller cluster with a couple of servers from the beginning, there's a worry that the cluster will have to scale-out soon along with the business grows. Actually I don't konw how the VoltDB will re-organize data when a cluster expand to more nodes horizontally.
Do you have comments? Appreciated.

It may be more optimal to set the sitesperhost to less than 32, so that some % of cores are free to run threads for subsystems like export or database replication, or to handle non-VoltDB processes. Typically somewhere from 8 - 24 is the optimal number.
VoltDB creates the logical partitions based on the sitesperhost, the number of hosts, and the kfactor. If you need to scale out later, you can add additional nodes to the cluster which will increase the number of partitions, and VoltDB will gradually and automatically rebalance data from existing partitions to the new ones. You must add multiple servers together if you have kfactor > 0. For kfactor=2, you would add servers in sets of 3 so that they provide their own redundancy for the new partitions.
Your data is distributed across the logical partitions based on a hash of the partition key value of a record, or the corresponding input parameter for routing the execution of a procedure to a partition. In this way, the client application code does not need to be aware of the # of partitions. It doesn't matter so much which partition each record goes to, but you can assume that any records that share the same partition key value will be located in the same partition.
If you choose partition keys well, they should be columns with high cardinality, such as ID columns. This will evenly distribute the data and procedure execution work across the partitions.
Typically a VoltDB cluster is sized based on the RAM requirements, rather than the need for performance, since the performance is very high on even a very small cluster.
You can contact VoltDB at info#voltdb.com or ask more questions at http://chat.voltdb.com if you'd like to get help with an evaluation or discuss cluster sizing and planning with an expert.
Disclaimer: I work for VoltDB.

Related

How to determine the number of executors to read a delta table?

I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).

How do you efficiently bucket/partition on a shared cluster that autoscales?

Edit: Using Spark with Databricks
As far as I understand, effective partitioning should be based on the number of executors available, ideally partitions % executors = 0
But if you work on a shared Spark cluster that autoscales according to activity, and in which people may be keeping some executors busy with their own work, is it possible to efficiently partition and bucket in this way?
Say I notice there are 8 exectutors active on the cluster, so I make 8 partitions or buckets to distribute the workload more easily. While that's happening, Alice and Jane log on and start running big queries, so the cluster upscales to say, 12 executors.
Now I'm no longer efficiently parititioned. Or what if the cluster doesn't upscale, but Alice and Jane take up some executors, now my partitions will be skewed, right?
Or... will Spark recognise that I have 8 partitions, and upscale as needed to match that if enough aren't immediately available?
The rule partitions % executors = 0 is applied to the efficient processing so you don't have less partitions than executors at some point of time. Really, the things are more complicated - partitions could be small, and then automatically coalesced when Adaptive Query Execution (AQE) kicks in, combining multiple small partitions into bigger logical partitions, etc. And it's one of the "optimizations" on Spark 3.x - set shuffle partitions to some big number, and allow AQE to optimize it, instead of ending with too big partitions.
Yes, on shared cluster, some of resources could be consumed by other users, but that's just will allocate less cores for your processing, but not skew your partitions. Skewed partitions are primarily related to the partitions of different sizes, but this also should be handled by AQE that is enabled on DBR 7.3+.
Overall: yes, on shared clusters some resources will be taken by other users, but otherwise it's better to rely on the improvements in the Spark 3.x in area of automatic optimization. In previous versions there was a lot of manual tuning that isn't required in newer versions.

What are Shuffled Partitions?

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."
What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?
Thanks!
Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.
Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one.
On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine.
Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions
Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:
A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.
So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.
spark.sql.shuffle.partitions is the parameter which determines how many blocks your shuffle will be performed in.
Say you had 40Gb of data and had spark.sql.shuffle.partitions set to 400 then your data will be shuffled in 40gb / 400 sized blocks (assuming your data is evenly distributed).
By changing the spark.sql.shuffle.partitions you change the size of blocks being shuffled and the number of blocks for each shuffle stage.
As Daniel says a rule of thumb is to never have spark.sql.shuffle.partitions set lower than the number of cores for a job.

Cassandra cluster - Store equal data among the nodes

In Cassandra Cluster, how can we ensure all nodes are having almost equal data, instead one node has more data, another has very less.
If this scenario occurs, what are the best practices
Thanks
It is ok to expect a slight variation of 5-10%. The most common causes are the distribution of your partitions may not be truly random (more partitions on some nodes) and there may be a large variation in the size of the partitions (smallest partition is a few kilobytes but largest partition is 2GB).
There are also 2 other possible scenarios to consider.
SINGLE-TOKEN CLUSTER
If the tokens are not correctly calculated, some nodes may have a larger token range compared to others. Use the token generation tool to get a list of tokens that is correctly distributed around the ring.
If the cluster is deployed with DataStax Enterprise, the easiest way is to rebalance your cluster with OpsCenter.
VNODES CLUSTER
Confirm that you have allocated the same number of tokens in cassandra.yaml with the num_tokens directive.
Unless you are using ByteOrderedPartitioner for your cluster that should not happen. See DataStax documentation here for more information about available partitioners and why it should not (normally) happen.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Resources