I have a simple question in spark transformation function.
coalesce(numPartitions) - Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
val dataRDD = sc.textFile("/user/cloudera/inputfiles/records.txt")
val filterRDD = dataRDD.filter(record => record.split(0) == "USA")
val resizeRDD = filterRDD.coalesce(50)
val result = resizeRDD.collect
My question is
Is it true that coalesce(numPartitions) will remove the empty partitions from filterRDD?
Does coalesce(numPartitions) undergo shuffling or not?
The coalesce transformation is used to reduce the number of partitions. coalesce should be used if the number of output partitions is less than the input. It can trigger RDD shuffling depending on the shuffle flag which is disabled by default (i.e. false).
If number of partitions is larger than current number of partitions and you are using coalesce method without shuffle=true flag then number of partitions remains unchanged.coalesce doesn't guarantee that the empty partitions will be removed. For example if you have 20 empty partitions and 10 partitions with data, then there will still be empty partitions after you call rdd.coalesce(25). If you use coalesce with shuffle set to true then this will be equivalent to repartition method and data will be evenly distributed across the partitions.
Related
Lets say I have a dataset with 20 partitions when I was going to read some data. Then I do aggregate operation on that dataset , which would make no of partitions to be 200(because of default shuffle partitions size). Now without calling any action on that dataset so far , I apply coalesce on that same data set giving 30 partitions in coalesce operation and then call some spark action on that dataset.
So my question is, how many partitions will be in action while that dataset would be having its aggregate operation ? Will it be 30 partitions(because that was the coalesce partitions given ) only or 200 shuffle partitions ?
Editing to provide more clarification on my question:
I understand that coalesce operation in itself will not do shuffle unless we drastically changed no of partitions. I also understand that final dataset will have numPartitions size only , but my question is if I change no of partitions before calling any action on that dataframne , would that resulting action will operate on the final no of partitions we had given(in my case 30) or it will also honor intermediate partitions size that we had given in aggregate operation. So in all, I am mainly looking whether aggregation will be done with 200 partitions and then coalesce will be applied or aggregation will also be performed with 30(in my case) partitions only.
Yes, your final action will operate on partitions generated by coalesce, like in your case it's 30.
As we know there is two types of transformation narrow and wide.
Narrow transformation don't do shuffling and don't do repartitioning but wide shuffling shuffle the data between node and generate new partition.
So if you check coalesce is a wide transformation and it will create a new stage before proceeding for next transformation or action and next stage will work on shuffle partition generated by coalesce.
So yes, your actions will going to work on 30 partitions.
https://www.google.com/amp/s/data-flair.training/blogs/spark-rdd-operations-transformations-actions/amp/
Coalesce
Returns a new SparkDataFrame that has exactly numPartitions
partitions. This operation results in a narrow dependency, e.g. if you
go from 1000 partitions to 100 partitions, there will not be a
shuffle, instead each of the 100 new partitions will claim 10 of the
current partitions. If a larger number of partitions is requested, it
will stay at the current number of partitions.
However, if you're doing a drastic coalesce on a SparkDataFrame, e.g.
to numPartitions = 1, this may result in your computation taking place
on fewer nodes than you like (e.g. one node in the case of
numPartitions = 1). To avoid this, call repartition. This will add a
shuffle step, but means the current upstream partitions will be
executed in parallel (per whatever the current partitioning is).
https://spark.apache.org/docs/2.2.1/api/R/coalesce.html
Coalesce: Shuffle the data into existing number of partitions.
https://medium.com/#mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j
Assume I have a 8 node Spark cluster with 8 partitions (i.e each node has 1 partitions)
Now if I try to reduce the number of partitions to 4 by using coalesce(4),
1. Will coalesce perform shuffle ?
2. If yes, then in which nodes will the newly created 4 partitions reside ?
If You Check Spark API documentation of Coalesce. Then that is following
coalesce(int numPartitions, boolean shuffle, scala.math.Ordering<T> ord)
by default the shuffle Flag is False. Repartition calls same method by changing shuffle flag to True. With This info, Now let us answer your question
To Change Number of Partitions From 8 to 4, Shuffle has to happen. But here you are explicitly saying No to shuffle. So Number of Partitions in This case will not change.
even If you Try to Increase number of Partitions, It will not change. Since shuffle flag is False. Hope This Helps
Cheers!
Coalesce by default has shuffling flag set to False.
If you have to increase the partitions, you can either use coalesce with shuffle flag set to true(with false, partition remains unchanged) or use repartition
If you are decreasing partitions, better to use coalesce with flag set to False as it avoids full shuffle unlike repartition where shuffling is guaranteed.
Coalesce with false shuffling moves data on 1 partition to another existing partition thereby avoiding full shuffle giving better performance.
say, data from partitions 5,6,7,8 will be moved to existing partitions 1,2,3,4 rather than shuffling the data of all 8 partitions
Determining on which node data resides is decided by the partitioner you are using
coalesce(numpartitions) - used to reduce the no of partitions without shuffling
coalesce(numpartitions,shuffle=false) - spark won't perform any shuffling because of shuffle = false option and used to reduce the no of partitions
coalesce(numpartitions,shuffle=true) - spark will perform shuffling because of shuffle = true option and used to reduce and increase the partitions
Example :
Assume rdd with 8 partitions initially
rdd.coalesce(4) - will results 4 partitons as output
rdd.coalesce(4,false) - will results 4 partitons as output
rdd.coalesce(10,false) - will results 8 partitons as output (shuffle = false will be able to reduce the partitons but not able to increase)
rdd.coalesce(4,true) - will results 4 partitons as output
rdd.coalesce(10,true) - will results 10 partitons as output (shuffle = true will be able to able to increase partitons)
In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.:
... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200.
This is set by spark.sql.shuffle.partitions. ...
So, my question is:
Do we mean that if we have set partitioning at 765 for a DF, for example,
That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting?
Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR?
I ask as I never see a clear viewpoint.
I did the following test:
// genned a DS of some 20M short rows
df0.count
val ds1 = df0.repartition(765)
ds1.count
val ds2 = df0.repartition(765)
ds2.count
sqlContext.setConf("spark.sql.shuffle.partitions", "765")
// The above not included on 1st run, the above included on 2nd run.
ds1.rdd.partitions.size
ds2.rdd.partitions.size
val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer")
joined.rdd.partitions.size
joined.count
joined.rdd.partitions.size
On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
It is a combination of both your guesses.
Assume you have a set of input data with M partitions and you set shuffle partitions to N.
When executing a join, spark reads your input data in all M partitions and re-shuffle the data based on the key to N partitions. Imagine a trivial hashpartitioner, the hash function applied on the key pretty much looks like A = hashcode(key) % N, and then this data is re-allocated to the node in charge of handling the Ath partition. Each node can be in charge of handling multiple partitions.
After shuffling, the nodes will work to aggregate the data in partitions they are in charge of. As no additional shuffling needs to be done here, the nodes can produce the output directly.
So in summary, your output will be coalesced to N partitions, however it is coalesced because it is processed in N partitions, not because spark applies one additional shuffle stage to specifically repartition your output data to N.
Spark.sql.shuffle.partitions is the parameter which decides the number of partitions while doing shuffles like joins or aggregation i.e where data movement is there across the nodes. The other part spark.default.parallelism will be calculated on basis of your data size and max block size, in HDFS it’s 128mb. So if your job does not do any shuffle it will consider the default parallelism value or if you are using rdd you can set it by your own. While shuffling happens it will take 200.
Val df = sc.parallelize(List(1,2,3,4,5),4).toDF()
df.count() // this will use 4 partitions
Val df1 = df
df1.except(df).count // will generate 200 partitions having 2 stages
I want to even out the partition size of rdds/dataframes in Spark to get rid of straggler tasks that slow my job down. I can do so using repartition(n_partition), which creates partitions of quite uniform size. However, that involves an expensive shuffle.
I know that coalesce(n_desired_partitions) is a cheaper alternative that avoids shuffling, and instead merges partitions on the same executor. However, it's not clear to me whether this function tries to create partitions of roughly uniform size, or simply merges input partitions without regard to their sizes.
For example, let's say that the following we have an Rdd of the integers in the range [1,12] in three partitions as follows: [(1,2,3,4,5,6,7,8),(9,10),(11,12)]. Let's say these are all on the same executor.
Now I call rdd.coalesce(2). Will the algorithm that powers coalesce know to merge the two small partitions (because they're smaller and we want balanced partition sizes), rather than just merging two arbitrary partitions?
Discussion of this topic elsewhere
According to this presentation (skip to 7:27) Netflix big data team needed to implement a custom coalese function to balance partition sizes. See also SPARK-14042.
Why this question's not a duplicate
There is a more general question about the differences between partition and coalesce here, but nobody gets there explains whether the algorithm that powers coalesce tries to balance partition size.
So actually repartition is nothing its def is look like below
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
So its simply coalesce with shuffle but when call coalesce its shuffle will be by default false so it will not shuffle the data till its will not needed.
Example you have 2 cluster node and each have 2 partitions and now u call rdd.coalesce(2) so it will merge the local partitions of the node or if you call the coalesce(1) then it will need the shuffle because other 2 partition will be on another node so may be in your case it will join local node partitions and that node have less number of partitions so ur partition size is not uniform.
ok according to your editing of question i also try to do the same as follows
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12))
data.getNumPartitions
res2: Int = 4
data.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count
the output of above code will be
And now i coalesce the 4 partition to 2 and run the same code on that rdd to check how optimize spark coalesce the data so the output will be
Now you can easily see that the spark equally distribute the data to both the partitions 6-6 even before coalesce it the number of elements are not same in all partitions.
val coal=data.coalesce(2)
coal.getNumPartitions
res4: Int = 2
coal.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count
When does an RDD get it's preferred location? How is the preferred location determined?
I've seen some weird behaviors in repartition and coalesce I could not quite make sense of:
1. When coalescing form n to n-1 partitions, I see spark just coalesce one partition to another single partition. (I think the ideal behavior would be evenly distribute to all n-1 nodes)
When run repartition I see spark repartition such that one node have multiple partition of rdds.
Does the above behavior have something to do with preferedLocations?
Note that rdd.repartition(n) just calls rdd.coalesce(n, shuffle = true), so we're just comparing shuffle true vs false.
shuffle = false
In this mode, Spark constructs a new RDD whose partitions contain one or more partitions of the parent RDD -- if you coalesce from n partitions -> n/2 partitions, then each partition consists of the elements from two semi-random partitions in the parent. This mode is appropriate when you want to reduce partitioning and the partitions are already balanced, like when you've done a filter that affects elements in each partition roughly equally. The overhead is very low. Also, note that it's impossible to increase number of partitions with this mode.
shuffle = true
For some background, I recommend this blog post for learning a bit more about how and why we shuffle. The fundamental differences in this execution mode are:
higher overhead (all data is transmitted over network)
good for rebalancing partitions (if you perform a filter that drops out either all elements in a partition or none, then shuffle=false will produce imbalanced partitions, but shuffle=true will resolve the issue)
can increase the number of partitions
Preferred locations don't have much to do with it -- you're seeing preferred locations only in the shuffle = false mode because the locality is preserved without shuffles, but after a shuffle the original preferredLocations are irrelevant (replaced with new preferred locations about shuffle destinations).