Joining RDDs in Spark per partition to avoid shuffle - apache-spark

I have to perform a join between two rdds, of the form rdd1.join(rdd2).
In order to avoid shuffling, I have partitioned the two rdds based on the expected queries. Both of them have the same number of partitions, generated using the same partitioner.
The problem is now reduced to a per-partition join, i.e. I'd like to join partition i from rdd1 with partition i from rdd2 and collect the results.
How can this be achieved (in scala)?

Related

Spark's Shuffle Sort Merge Join. One DataFrame is bucketed. Does Spark take advantage of this?

I remember from working with RDDs, that if one key-value RDD (rdd1) has a known partitioning, then performing a join with a different, unpartitioned, key-value RDD (rdd2) would give performance benefits. This is because 1) only the data of rdd2 would need to be transferred across the network, and 2) each element of rdd2 would only need to be transferred to one node rather than all, by applying the partitioning of the key of rdd1 to the key of rdd2
I'm learning about Shuffle Sort Merge Joins with DataFrames. The example in the book I am reading (Learning Spark, 2nd Edition) is for joining two DataFrames based on user_id columns. The example is attempting to demonstrate the elimination of the Exchange stage from the join operation, so, prior to the join, both DataFrames are bucketed into an equal number of buckets by the column to be joined on.
My question is, what happens if only one of the DataFrames has been bucketed? Clearly the Exchange stage will reappear. But if we know that DataFrame1 is bucketed into N buckets by the column we want to join on, will Spark use this bucketing information to efficiently transfer the rows of DataFrame2 over the network, as in the RDD case? Would Spark leave the rows of DataFrame1 where they are, and just apply an identical bucketing to DataFrame2? (Assuming that N buckets results in a reasonable amount of data in the partitions to be joined by the executors) Or instead, does Spark inefficiently shuffle both DataFrames?
In particular, I can imagine a situation where I have a single 'master' DataFrame against which I will need to perform many independent joins with other supplemental DataFrames on the same column. Surely it should only be necessary to pre-bucket the master DataFrame in order to see the performance benefits for all joins? (Although taking the trouble to bucket the supplemental DataFrames wouldn't hurt either, I think)
https://kb.databricks.com/data/bucketing.html This explains it all with some embellishment over their original postings which I summarize.
Bottom line:
val t1 = spark.table("unbucketed")
val t2 = spark.table("bucketed")
val t3 = spark.table("bucketed")
Unbucketed - bucketed join. Both sides need to be repartitioned.
t1.join(t2, Seq("key")).explain()
Unbucketed with repartition - bucketed join. Unbucketed side is
correctly repartitioned, and only one shuffle is needed.
t1.repartition(16, $"key").join(t2, Seq("key")).explain()
Unbucketed with incorrect repartitiong (default(200) - bucketed join.
Unbucketed side is incorrectly repartitioned, and two shuffles are
needed.
t1.repartition($"key").join(t2, Seq("key")).explain()
bucketed - bucketed join. Ideal case, both sides have the same
bucketing, and no shuffles are needed.
t3.join(t2, Seq("key")).explain()
So, both sides need same bucketing for optimal performance.

How to preserve partitioning through dataframe operations

Is there a reliable way to predict which Spark dataframe operations will preserve partitioning and which won't?
Specifically, let's say my dataframes are all partitioned with .repartition(500,'field1','field2').
Can I expect an output with 500 partitions arranged by these same fields if I apply:
select()
filter()
groupBy() followed by agg() when grouping happens on 'field1' and 'field2' (as in the above)
join() on 'field1' and 'field2' when both dataframes are partitioned as above
Given the special way my data is prepartitioned, I'd expect no extra shuffling to take place. However, I always seem to end up with at least few stages having number of tasks equal to spark.sql.shuffle.partitions. Any way to avoid that extra shuffling bit?
Thanks
This is an well known issue with Spark. Even if you have re-partitioned the data Spark will shuffle the data.
What is the Problem
The re-partition ensures each partition contains the data about a single column value.
Good example here:
val people = List(
(10, "blue"),
(13, "red"),
(15, "blue"),
(99, "red"),
(67, "blue")
)
val peopleDf = people.toDF("age", "color")
colorDf = peopleDf.repartition($"color")
Partition 00091
13,red
99,red
Partition 00168
10,blue
15,blue
67,blue
However Spark doesn't remember this information for subsequent operations. Also the total ordering of the partitions across different partitions are not kept in spark. i.e. Spark knows for a single partition it has data about one partition but doesn't know which other partitions have the data about the same column. Also the sorting is required in the data to ensure shuffle not required.
How can you solve
You need to use the Spark Bucketing feature
feature to ensure no shuffle in subsequent stages.
I found this Wiki is pretty detailed about the bucketing features.
Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.
The motivation is to optimize performance of a join query by avoiding shuffles (aka exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so stages).

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?
If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
ex:
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
A.join(B, Seq("id"))
Reference is from Learning Spark book.
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.

Worker Behavior with two (or more) dataframes having the same key

I'm using PySpark (Spark 1.4.1) in a cluster. I have two DataFrames each containing the same key values but different data for the other fields.
I partitioned each DataFrame separately using the key and wrote a parquet file to HDFS. I then read the parquet file back into memory as a new DataFrame. If I join the two DataFrames, will the processing for the join happen on the same workers?
For example:
dfA contains {userid, firstname, lastname} partitioned by userid
dfB contains {userid, activity, job, hobby} partitioned by userid
dfC = dfA.join(dfB, dfA.userid==dfB.userid)
Is dfC already partitioned by userid?
Is dfC already partitioned by userid
The answer depends on what you mean by partitioned. Records with the same userid should be located on the same partition, but DataFrames don't support partitioning understood as having a Partitioner. Only the PairRDDs (RDD[(T, U)]) can have partitioner in Spark. It means that for most applications the answer is no. Neither DataFrame or underlaying RDD is partitioned.
You'll find more details about DataFrames and partitioning in How to define partitioning of DataFrame? Another question you can follow is Co-partitioned joins in spark SQL.
If I join the two DataFrames, will the processing for the join happen on the same workers?
Once again it depends on what you mean. Records with the same userid have to be transfered to the same node before transformed rows can be yielded. I ask if it is guaranteed to happen without any network traffic the answer is no.
To be clear it would be exactly the same even if DataFrame had a partitioner. Data co-partitioning is not equivalent to data co-location. It simply means that join operation can be performed using one-to-one mapping not shuffling. You can find more in Daniel Darbos' answer to Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?.

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?
union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.
After taking the union you can run repartition to shuffle the data and organize it by key.
There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the unioned RDD will also have those same partitions. This was introduced in https://github.com/apache/spark/pull/4629 and incorporated into Spark 1.3.

Resources