Spark SQL broadcast hash join - apache-spark

I'm trying to perform a broadcast hash join on dataframes using SparkSQL as documented here:
In that example, the (small) DataFrame is persisted via saveAsTable and then there's a join via spark SQL (i.e. via sqlContext.sql("..."))
The problem I have is that I need to use the sparkSQL API to construct my SQL (I am left joining ~50 tables with an ID list, and don't want to write the SQL by hand).
How do I tell spark to use the broadcast hash join via the API? The issue is that if I load the ID list (from the table persisted via `saveAsTable`) into a `DataFrame` to use in the join, it isn't clear to me if Spark can apply the broadcast hash join.

You can explicitly mark the DataFrame as small enough for broadcasting
using broadcast function:
from pyspark.sql.functions import broadcast
small_df = ...
large_df = ...
large_df.join(broadcast(small_df), ["foo"])
or broadcast hint (Spark >= 2.2):
large_df.join(small_df.hint("broadcast"), ["foo"])
import org.apache.spark.sql.functions.broadcast
val smallDF: DataFrame = ???
val largeDF: DataFrame = ???
largeDF.join(broadcast(smallDF), Seq("foo"))
or broadcast hint (Spark >= 2.2):
largeDF.join(smallDF.hint("broadcast"), Seq("foo"))
You can use hints (Spark >= 2.2):
SELECT /*+ MAPJOIN(small) */ *
FROM large JOIN small
ON =
FROM large JOIN small
ON =
SELECT /*+ BROADCAST(small) */ *
FROM large JOIN small
ON =
R (SparkR):
With hint (Spark >= 2.2):
join(large, hint(small, "broadcast"), large$foo == small$foo)
With broadcast (Spark >= 2.3)
join(large, broadcast(small), large$foo == small$foo)
Broadcast join is useful if one of structures is relatively small. Otherwise it can be significantly more expensive than a full shuffle.

jon_rdd = sqlContext.sql( "select * from people_in_india p
join states s
on p.state =")
jon_rdd.toDebugString() / join_rdd.explain() :
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states.
uneven sharding.
Limited parallelism with 29 output partitions.
broadcast the small RDD to all worker nodes.
parallelism of the large rdd is still maintained and shuffle is not even
PS: Image may ugly but informative.

With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.
As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.
So if you are trying to perform smallDF.join(largeDF)
Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!
Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.
Above mentioned example is sufficient enough to start playing with broadcast join.
Cannot modify value after creation.
If you try, change will only be on one&node


How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
Which internally calls the apply method of dataset & set the logicalPlan using ResolvedHint
val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
But what is after this. How this actually work, where is code written for that.
What if we have multiple partitions of small dataset (which we are going to broadcast), does spark combine all partitions & then broadcast?
Does it broadcast to driver first & then it goes executors.
What is BitTorrent.
Regarding 1 & 2 During broadcast join data are collected on driver and what is going on later depends on join algorith
For BroadcastHashJoin(BHJ) driver builds hashtable and then this table is distributed to executors
For BroadcastNestedLoops broadcasted dataset is distributed as array to executors
So as you can see initial structure is not kept here and whole broadcasted dataset needs to fit into driver's memory (otherwise job will fail with oom error on driver)
Regarding 3 what exactly do you want to know?
In spark there is TorrentBroadcast which is BitTorrent-like implementation of broadcast. I don't know much about it (i never had to dig so deep), but if you want to know more i think that you can start here:
TorrentBroadcast docu
TorrentBroadcast source code
HttpBroadcast docu - its other broadcast algorithm

Spark seems to think that a particular broadcast variable is large in size

I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB.
What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoin instead of a broadcast join. But the size of the smaller table shouldn't be this big at all. I wrote the smaller table to a s3 folder and it took only 12.6 MB of space.
I did some operations on the smaller table so the shuffle size appears on the Spark History Server and the size in memory seemed to be 150 MB, nowhere near 10G. Also, if I force a broadcast join on the smaller table it takes a long time to broadcast, leading me to think that the table might not just be 150 MB in size.
What would be a good way to figure out the actual size that Spark is seeing and deciding whether it crosses the value defined by spark.sql.autoBroadcastJoinThreshold?
look at the SQL tab in the spark UI. there you will see the DAG of each job + the statistics that spark collects.
for each dataframe, it will contain the size as spark sees it.
BTW, you don't have set spark.sql.autoBroadcastJoinThreshold to a high number to force spark using the broadcast join.
you can simply wrap the small df with org.apache.spark.sql.functions.broadcast(df) and it will force broadcast only on that specific join
As mentioned in this question: DataFrame join optimization - Broadcast Hash Join
import org.apache.spark.sql.functions.broadcast
val employeesDF = employeesRDD.toDF
val departmentsDF = departmentsRDD.toDF
// materializing the department data
val tmpDepartments = broadcast("departments"))
import context.implicits._
$"depId" === $"id", // join by employees.depID ==

What Transformation should I apply on Spark DataFrame

I have 2 Spark dataframes (A & B) having a common column/field in both (which is a primary key in DataFrame A but not in B).
For each record/row in dataframe A, there are multiple records in dataframe B.
Based on that common column value I want to fetch all records from dataframe B against each record in dataframe A.
What kind of transformation should I perform in order to collect the records together without doing much shuffling?
To combine the records from 2 or more spark Dataframes, join is necessary.
If your data is not partitioned / bucketed well, it will lead to a Shuffle join. In which every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining). These joins are expensive because the network can become congested with traffic.
The shuffle can be avoided if:
Both Dataframes have a known partitioner or Bucketized.
One of the datasets is small enough to fit in memory, in which case we can do a broadcast hash join
If you partition your data correctly prior to a join, you can end up with much more efficient execution because even if a shuffle is planned, if data from two different DataFrames is already located on the same machine, Spark can avoid the shuffle.
// you can optionally specify the number of partitions like:
df1.repartition(10, col("id"))
// Join Dataframes on id column
df1.join(df2, "id") // this will avoid the duplicate id columns in output DF.
Broadcast Hash join
When one of the Dataset is small enough to fit into the memory of a single worker node, , we can optimize our join.
Spark will replicate the small DataFrame onto every worker node in the cluster (be it located on one machine or many). Now this sounds expensive. However, what this does is prevent us from performing the all-to-all communication during the entire join process. Instead, it performs only once at the beginning and then let each individual worker node perform the work without having to wait or communicate with any other worker node.
import org.apache.spark.sql.functions.broadcast
// explicitly specify the broadcast hint, though spark handles it.
df1.join(broadcast(df2), "id")

Can you do a broadcast join with SparkR?

I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.
However I couldn't find the broadcast function in the SparkR documentation.
So I'm wondering if you can do a broadcast join with SparkR?
Spark 2.3:
There will be broadcast function created in this pull request:
Spark 2.2:
You can provide custom hint to query:
head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))
Reference: this code:
Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.
Side note:
Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:
spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")
Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?
If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
A.join(B, Seq("id"))
Reference is from Learning Spark book.
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.
