Difference between shuffle partition and repartition - apache-spark

I am a newbie in spark and I am trying to understand shuffle partition and repartition function. But i still dont understand how they are different. Both reduces the number of partition??
Thank you

The biggest difference between shuffle partition and repartition is when things are defined.
The configuration spark.sql.shuffle.partitions is a property and according to the documentation
Configures the number of partitions to use when shuffling data for joins or aggregations.
That means, every time you run a Join or any type of aggregation in spark that will shuffle the data according to the configuration, where the default value is 200. So if you join two datasets the number of partitions in the shuffle will be 200.
The repartition(numPartitions, *cols) function is applied during an execution, where you can define how many partitions you will write, that usually is for output writing based in partition columns or just number. The example in the documentation is pretty good to show.
So in general, Shuffle Partition is for Joins and Aggregations during the execution. Repartition is for number of output files, based in number or partition column.

Related

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is
When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce.
For Shuffle, it only comes into the play when you hit any shuffle stages and data will continue to remain on those partitions until you hit coalesce or repartition.
I am right ?
If yes then, can any one point out a striking difference?
TLDR - Repartition is invoked as per developer's need but shuffle is done when there is a logical demand
I assume you're talking about config property spark.sql.shuffle.partitions and method .repartition.
As data distribution is an important aspect in any distributed environment, which not only governs parallelism but can also create adverse impacts if the distribution is uneven. However, repartitioning itself is a costly operation as it involves heavy movement of data (i.e. Shuffling). The .repartition method is used to explicitly repartition the data into new partitions - meaning to increase or decrease the number of partitions in the program based on your need. You can invoke this whenever you want.
As opposed to this, spark.sql.shuffle.partitions is a configuration property that governs the number of partitions created when a data movement happens as a result of operations like aggregations and joins.
Configures the number of partitions to use when shuffling data for
joins or aggregations.
When you're performing transformations other than join or aggregation, the above configuration won't have any impact on the number of partitions the new Dataframe will have.
Your confusion between the two is due to both operations involving shuffling. While that is true, the former (i.e. repartition) is an explicit operation where the user is dictating the framework to increase or decrease the number of partitions - which in turn causes shuffling, while in case of joins/aggregation - the shuffling is caused by the operation itself.
Basically -
Joins/Aggregations cause shuffling which causes repartitioning
repartition is asked thus, shuffling has to be done
Another method coalesce make the difference clearer.
For reference, coalesce is a variant of repartition which can only lower the number of partitions, not necessarily equal in size. As it already knows the number of partitions are only to be decreased, it can perform it with minimal shuffling (just join two adjacent partitions until the number is met).
Consider your dataframe has 4 partitions but has data only in 2 of them, thus you decide to reduce the number of partitions to 2. When using coalesce spark tries to achieve this without shuffling or with minimal shuffling.
df.rdd().getNumPartitions(); // Returns 4 with size 0, 0, 2, 4
df=df.coalesce(2); // Decrease partitions to 2
df.rdd().getNumPartitions(); // Returns 2 now with size 2, 4
So there was no shuffling involved. While the following
df1.rdd().getNumPartitions() // Returns 4
df2.rdd().getNumPartitions() // Returns 8
df1.join(df2).rdd().getNumPartitions() // Returns 200
As you've performed a join it'll always return the number of partitions based on spark.sql.shuffle.partitions

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

How does Spark decide the partitions number of the next stage when shuffle in SparkSQL?

Of course I know the spark.sql.shuffle.partitionsconfig,
but for example, when I set this config 300 on the small dataset which just has 200 rows, the config is not valid, the actual partition number is just 2,
anthor example, I set this config 3000 on the dataset which has 30 billion rows, the config is not valid too, the actual partition number is just 600,
we see that when we set a big value partitions config on a small dataset, the config would be not valid,
So I just want to know How does Spark decide the partitions number of the next stage when shuffle in SparkSQL? Or How to force this config to be valid ?
My Spark SQL is just like below:
set spark.sql.shuffle.partitions=3000;
with base_data as (
select
device_id
from
table_name
where
dt = '20210621'
distribute by
rand()
)
select count(1) from base_data
In general Narrow transformation does not change number of partitions .
Wide transformations transformation does not change number of partitions.
Narrow transformation In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.
Update after question change:
you can assume "spark.sql.shuffle.partitions" as a query hint where we are forcing executors that make that number of partitions for joins or aggregations in my view we should not play with this value unless we are very sure that what are no of grouping key would be.
This will make unnecessary shuffling of data over the network.

spark shuffle partitions with coalesce

Lets say I have a dataset with 20 partitions when I was going to read some data. Then I do aggregate operation on that dataset , which would make no of partitions to be 200(because of default shuffle partitions size). Now without calling any action on that dataset so far , I apply coalesce on that same data set giving 30 partitions in coalesce operation and then call some spark action on that dataset.
So my question is, how many partitions will be in action while that dataset would be having its aggregate operation ? Will it be 30 partitions(because that was the coalesce partitions given ) only or 200 shuffle partitions ?
Editing to provide more clarification on my question:
I understand that coalesce operation in itself will not do shuffle unless we drastically changed no of partitions. I also understand that final dataset will have numPartitions size only , but my question is if I change no of partitions before calling any action on that dataframne , would that resulting action will operate on the final no of partitions we had given(in my case 30) or it will also honor intermediate partitions size that we had given in aggregate operation. So in all, I am mainly looking whether aggregation will be done with 200 partitions and then coalesce will be applied or aggregation will also be performed with 30(in my case) partitions only.
Yes, your final action will operate on partitions generated by coalesce, like in your case it's 30.
As we know there is two types of transformation narrow and wide.
Narrow transformation don't do shuffling and don't do repartitioning but wide shuffling shuffle the data between node and generate new partition.
So if you check coalesce is a wide transformation and it will create a new stage before proceeding for next transformation or action and next stage will work on shuffle partition generated by coalesce.
So yes, your actions will going to work on 30 partitions.
https://www.google.com/amp/s/data-flair.training/blogs/spark-rdd-operations-transformations-actions/amp/
Coalesce
Returns a new SparkDataFrame that has exactly numPartitions
partitions. This operation results in a narrow dependency, e.g. if you
go from 1000 partitions to 100 partitions, there will not be a
shuffle, instead each of the 100 new partitions will claim 10 of the
current partitions. If a larger number of partitions is requested, it
will stay at the current number of partitions.
However, if you're doing a drastic coalesce on a SparkDataFrame, e.g.
to numPartitions = 1, this may result in your computation taking place
on fewer nodes than you like (e.g. one node in the case of
numPartitions = 1). To avoid this, call repartition. This will add a
shuffle step, but means the current upstream partitions will be
executed in parallel (per whatever the current partitioning is).
https://spark.apache.org/docs/2.2.1/api/R/coalesce.html
Coalesce: Shuffle the data into existing number of partitions.
https://medium.com/#mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources