In Apache Spark, why does RDD.union not preserve the partitioner? - apache-spark

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?

union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.
After taking the union you can run repartition to shuffle the data and organize it by key.
There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)

This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the unioned RDD will also have those same partitions. This was introduced in https://github.com/apache/spark/pull/4629 and incorporated into Spark 1.3.

Related

Joining RDDs in Spark per partition to avoid shuffle

I have to perform a join between two rdds, of the form rdd1.join(rdd2).
In order to avoid shuffling, I have partitioned the two rdds based on the expected queries. Both of them have the same number of partitions, generated using the same partitioner.
The problem is now reduced to a per-partition join, i.e. I'd like to join partition i from rdd1 with partition i from rdd2 and collect the results.
How can this be achieved (in scala)?

How to preserve partitioning through dataframe operations

Is there a reliable way to predict which Spark dataframe operations will preserve partitioning and which won't?
Specifically, let's say my dataframes are all partitioned with .repartition(500,'field1','field2').
Can I expect an output with 500 partitions arranged by these same fields if I apply:
select()
filter()
groupBy() followed by agg() when grouping happens on 'field1' and 'field2' (as in the above)
join() on 'field1' and 'field2' when both dataframes are partitioned as above
Given the special way my data is prepartitioned, I'd expect no extra shuffling to take place. However, I always seem to end up with at least few stages having number of tasks equal to spark.sql.shuffle.partitions. Any way to avoid that extra shuffling bit?
Thanks
This is an well known issue with Spark. Even if you have re-partitioned the data Spark will shuffle the data.
What is the Problem
The re-partition ensures each partition contains the data about a single column value.
Good example here:
val people = List(
(10, "blue"),
(13, "red"),
(15, "blue"),
(99, "red"),
(67, "blue")
)
val peopleDf = people.toDF("age", "color")
colorDf = peopleDf.repartition($"color")
Partition 00091
13,red
99,red
Partition 00168
10,blue
15,blue
67,blue
However Spark doesn't remember this information for subsequent operations. Also the total ordering of the partitions across different partitions are not kept in spark. i.e. Spark knows for a single partition it has data about one partition but doesn't know which other partitions have the data about the same column. Also the sorting is required in the data to ensure shuffle not required.
How can you solve
You need to use the Spark Bucketing feature
feature to ensure no shuffle in subsequent stages.
I found this Wiki is pretty detailed about the bucketing features.
Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.
The motivation is to optimize performance of a join query by avoiding shuffles (aka exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so stages).

Does coalesce(numPartitions) in spark undergo shuffling or not?

I have a simple question in spark transformation function.
coalesce(numPartitions) - Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
val dataRDD = sc.textFile("/user/cloudera/inputfiles/records.txt")
val filterRDD = dataRDD.filter(record => record.split(0) == "USA")
val resizeRDD = filterRDD.coalesce(50)
val result = resizeRDD.collect
My question is
Is it true that coalesce(numPartitions) will remove the empty partitions from filterRDD?
Does coalesce(numPartitions) undergo shuffling or not?
The coalesce transformation is used to reduce the number of partitions. coalesce should be used if the number of output partitions is less than the input. It can trigger RDD shuffling depending on the shuffle flag which is disabled by default (i.e. false).
If number of partitions is larger than current number of partitions and you are using coalesce method without shuffle=true flag then number of partitions remains unchanged.coalesce doesn't guarantee that the empty partitions will be removed. For example if you have 20 empty partitions and 10 partitions with data, then there will still be empty partitions after you call rdd.coalesce(25). If you use coalesce with shuffle set to true then this will be equivalent to repartition method and data will be evenly distributed across the partitions.

Does spark's coalesce function try to create partitions of uniform size?

I want to even out the partition size of rdds/dataframes in Spark to get rid of straggler tasks that slow my job down. I can do so using repartition(n_partition), which creates partitions of quite uniform size. However, that involves an expensive shuffle.
I know that coalesce(n_desired_partitions) is a cheaper alternative that avoids shuffling, and instead merges partitions on the same executor. However, it's not clear to me whether this function tries to create partitions of roughly uniform size, or simply merges input partitions without regard to their sizes.
For example, let's say that the following we have an Rdd of the integers in the range [1,12] in three partitions as follows: [(1,2,3,4,5,6,7,8),(9,10),(11,12)]. Let's say these are all on the same executor.
Now I call rdd.coalesce(2). Will the algorithm that powers coalesce know to merge the two small partitions (because they're smaller and we want balanced partition sizes), rather than just merging two arbitrary partitions?
Discussion of this topic elsewhere
According to this presentation (skip to 7:27) Netflix big data team needed to implement a custom coalese function to balance partition sizes. See also SPARK-14042.
Why this question's not a duplicate
There is a more general question about the differences between partition and coalesce here, but nobody gets there explains whether the algorithm that powers coalesce tries to balance partition size.
So actually repartition is nothing its def is look like below
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
So its simply coalesce with shuffle but when call coalesce its shuffle will be by default false so it will not shuffle the data till its will not needed.
Example you have 2 cluster node and each have 2 partitions and now u call rdd.coalesce(2) so it will merge the local partitions of the node or if you call the coalesce(1) then it will need the shuffle because other 2 partition will be on another node so may be in your case it will join local node partitions and that node have less number of partitions so ur partition size is not uniform.
ok according to your editing of question i also try to do the same as follows
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12))
data.getNumPartitions
res2: Int = 4
data.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count
the output of above code will be
And now i coalesce the 4 partition to 2 and run the same code on that rdd to check how optimize spark coalesce the data so the output will be
Now you can easily see that the spark equally distribute the data to both the partitions 6-6 even before coalesce it the number of elements are not same in all partitions.
val coal=data.coalesce(2)
coal.getNumPartitions
res4: Int = 2
coal.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count

Spark RDD - avoiding shuffle - Does partitioning help to process huge files?

I have an application with around 10 flat files each worth more than 200MM+ records in them. Business logic involves in joining all of them sequentially.
my environment:
1 master - 3 slaves (for testing i have assigned a 1GB memory to each node)
Most of the code just does the below for each join
RDD1 = sc.textFile(file1).mapToPair(..)
RDD2 = sc.textFile(file2).mapToPair(..)
join = RDD1.join(RDD2).map(peopleObject)
Any suggestion for tuning , like repartitioning, parallelize ..? if so, any best practices in coming up with good number for repartitioning?
with the current config the job takes more than an hour and i see the shuffle write for almost every file is > 3GB
In practice, with large datasets (5, 100G+ each), I have seen that the join works best when you co-partition all the RDDs involved in a series of join before you start joining them.
RDD1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(2048))
RDD2 = sc.textFile(file2).mapToPair(..).partitionBy(new HashPartitioner(2048))
.
.
.
RDDN = sc.textFile(fileN).mapToPair(..).partitionBy(new HashPartitioner(2048))
//start joins
RDD1.join(RDD2)...join(RDDN)
Side note:
I refer to this kind of a join as a tree join (each RDD used once). The rationale is presented in the following beautiful pic taken from the Spark-UI:
If we are always joining one RDD (say rdd1) with all the others, the idea is to partition that RDD and then persist it.
Here is sudo-Scala implementation (can easily be converted to Python or Java):
val rdd1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(200)).cache()
Up to here we have rdd1 to be hashed into 200 partitions. The first time it will get evaluated it will be persisted (cached).
Now let's read two more rdds and join them.
val rdd2 = sc.textFile(file2).mapToPair(..)
val join1 = rdd1.join(rdd2).map(peopleObject)
val rdd3 = sc.textFile(file3).mapToPair(..)
val join2 = rdd1.join(rdd3).map(peopleObject)
Note that for the remanning RDDs we do not partition them nor do we cache them.
Spark will see that rdd1 is already hashed partition and it will use the same partitions for all remaining joins. So rdd2 and rdd3 will shuffle their keys to the same locations where the keys of rdd1 are located.
To make it more clear, let's assume that we don't do the partition and we use the same code shown by the question; Each time we do a join both rdds will be shuffled. This means that if we have N joins to rdd1, the non partition version will shuffle rdd1 N times. The partitioned approach will shuffle rdd1 just once.

Resources