Settle the right number of partition on RDD - apache-spark

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)

The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Related

How do you efficiently bucket/partition on a shared cluster that autoscales?

Edit: Using Spark with Databricks
As far as I understand, effective partitioning should be based on the number of executors available, ideally partitions % executors = 0
But if you work on a shared Spark cluster that autoscales according to activity, and in which people may be keeping some executors busy with their own work, is it possible to efficiently partition and bucket in this way?
Say I notice there are 8 exectutors active on the cluster, so I make 8 partitions or buckets to distribute the workload more easily. While that's happening, Alice and Jane log on and start running big queries, so the cluster upscales to say, 12 executors.
Now I'm no longer efficiently parititioned. Or what if the cluster doesn't upscale, but Alice and Jane take up some executors, now my partitions will be skewed, right?
Or... will Spark recognise that I have 8 partitions, and upscale as needed to match that if enough aren't immediately available?
The rule partitions % executors = 0 is applied to the efficient processing so you don't have less partitions than executors at some point of time. Really, the things are more complicated - partitions could be small, and then automatically coalesced when Adaptive Query Execution (AQE) kicks in, combining multiple small partitions into bigger logical partitions, etc. And it's one of the "optimizations" on Spark 3.x - set shuffle partitions to some big number, and allow AQE to optimize it, instead of ending with too big partitions.
Yes, on shared cluster, some of resources could be consumed by other users, but that's just will allocate less cores for your processing, but not skew your partitions. Skewed partitions are primarily related to the partitions of different sizes, but this also should be handled by AQE that is enabled on DBR 7.3+.
Overall: yes, on shared clusters some resources will be taken by other users, but otherwise it's better to rely on the improvements in the Spark 3.x in area of automatic optimization. In previous versions there was a lot of manual tuning that isn't required in newer versions.

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance
One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html
IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

What is a performant partitioning strategy for key-agnostic mapping?

First of all I'm working with PySpark on Glue and I'm reading several very large CSV files. Those CSVs are bzip2 compressed and inflated several GB large.
At this stage of processing I'm only performing a simple map over all rows. No joins, group bys, filtering. Just a map.
Let's say I am working on 10 nodes. Generally speaking, would it be preferable to have a rather high number of partitions or a rather low number?
I would guess that independent of available cores on all those nodes that number should be pretty high to make sure that every executor is busy at all times having small chunks of data available.
So, let's say there are 20 cores on those 10 nodes and let's for a second assume there are key-based partitions then something larger than 40 would likely not be a good idea. But in the key-agnostic mapping case I'd tend to something like 1000 partitions or more.
Does that make sense? I'm especially interested in the thought process here.

What are Shuffled Partitions?

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."
What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?
Thanks!
Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.
Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one.
On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine.
Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions
Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:
A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.
So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.
spark.sql.shuffle.partitions is the parameter which determines how many blocks your shuffle will be performed in.
Say you had 40Gb of data and had spark.sql.shuffle.partitions set to 400 then your data will be shuffled in 40gb / 400 sized blocks (assuming your data is evenly distributed).
By changing the spark.sql.shuffle.partitions you change the size of blocks being shuffled and the number of blocks for each shuffle stage.
As Daniel says a rule of thumb is to never have spark.sql.shuffle.partitions set lower than the number of cores for a job.

Spark: understanding partitioning - cores

I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)
Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

Resources