Can you do a broadcast join with SparkR? - apache-spark

I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.
However I couldn't find the broadcast function in the SparkR documentation.
So I'm wondering if you can do a broadcast join with SparkR?

Spark 2.3:
There will be broadcast function created in this pull request: https://github.com/apache/spark/pull/17965/files
Spark 2.2:
You can provide custom hint to query:
head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))
Reference: this code: https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L3740
Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.
Side note:
Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:
spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")
Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here

Related

How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
Dataset[T](df.sparkSession,
ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
}
Which internally calls the apply method of dataset & set the logicalPlan using ResolvedHint
val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
But what is after this. How this actually work, where is code written for that.
What if we have multiple partitions of small dataset (which we are going to broadcast), does spark combine all partitions & then broadcast?
Does it broadcast to driver first & then it goes executors.
What is BitTorrent.
Regarding 1 & 2 During broadcast join data are collected on driver and what is going on later depends on join algorith
For BroadcastHashJoin(BHJ) driver builds hashtable and then this table is distributed to executors
For BroadcastNestedLoops broadcasted dataset is distributed as array to executors
So as you can see initial structure is not kept here and whole broadcasted dataset needs to fit into driver's memory (otherwise job will fail with oom error on driver)
Regarding 3 what exactly do you want to know?
In spark there is TorrentBroadcast which is BitTorrent-like implementation of broadcast. I don't know much about it (i never had to dig so deep), but if you want to know more i think that you can start here:
TorrentBroadcast docu
TorrentBroadcast source code
HttpBroadcast docu - its other broadcast algorithm

Spark Exception “Cannot broadcast the table that is larger than 8GB” , 'spark.sql.autoBroadcastJoinThreshold': '-1' not working

In one of our Pyspark jobs we have a scenario where we are doing a join between a large data frame and relatively smaller data frame , I believe that spark is using broadcast join and we ran in to the following error
org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 8 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
I tried disabling broadcast join by setting 'spark.sql.autoBroadcastJoinThreshold': '-1' as a part of spark submit
/usr/bin/spark-submit --conf spark.sql.autoBroadcastJoinThreshold=-1 /home/hadoop/scripts/job.py
I tried print the value of spark.sql.autoBroadcastJoinThreshold using
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
and it returns -1 . However even after this change i am getting the error
org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 8 GB
The spark version is Spark 2.3.0
Any help is appreciated.
Probably you are using maybe broadcast function explicitly. Even if you set spark.sql.autoBroadcastJoinThreshold=-1 and use a broadcast function explicitly, it will do a broadcast join.
Another reason might be you are doing a Cartesian join/non equi join which is ending up in Broadcasted Nested loop join (BNLJ join). As mentioned you better use the explain and understand what is happening.
To convert an optimized logical plan into physical plan, Spark uses some set of strategy. For Joins, Spark uses JoinSelection.
The way it works is documented here - https://github.com/apache/spark/blob/aefb2e7/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L326
Join Physical Operator Selection Requirements For BroadcastNestedLoopJoinExec -
There are no join keys and one of the following holds:
1) Join type is CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI or ExistenceJoin (i.e. canBuildRight for the input joinType is positive) and right join side can be broadcast
2) Join type is CROSS, INNER or RIGHT OUTER (i.e. canBuildLeft for the input joinType is positive) and left join side can be broadcast
OR
No other join operator has matched already
The smaller data frame that was used in the join was reused in multiple places. So cache the data frame before the join and it resolved the issue.
Why don't you explain the join and see the physical plan? By default, it will join using broadcast and if you disable it, it will use the sort join
print(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))
# should give 10Mb as default
and if you disable it
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
print(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))
#-1
you better use the explain and understand what is happening.

what would be the scope of a broadcasted and repartitioned dataframe?

If I have a spark dataframe(rightdf) ~4gb, can be used this part of code in a join:
leftdf.join(broadcast(rightdf.repartition(2)))?
It doesn't make sense to repartition rightdf before broadcast it. In fact, when you're using a broadcast join, the whole broadcast dataframe (in your case it's rightdf) while be sending to each node in the cluster.
So calling repartition before broadcast while only decrease your performance.

broadcast() multiple times the same df. Is it cached?

In our application, it happens we join the same dataframe multiple times with several other dataframes (not always the same joining column), in separate queries.
This left-hand side df is not very large, so a broadcast hint may be beneficial.
My questions :
if the same df get broadcast multiple times, will the transfer occur once (the broadcast data is somehow cached on executors), or multiple times ?
If the join concern different cols will it be cached as well, or what is broadcast depends on the join key ?
The heuristics the Spark optimizer uses change as the product evolves. The only way to know how your join would be handled is via explain. Take a look at this presentation for information on how to optimize your joins.
As for the specific question about repeated broadcast joins of the same dataframe, whether Spark will broadcast the dataframe once or more than once will depend on the memory state of the workers. If the initially broadcast dataframe is still present on the workers, Spark will not broadcast it again (from what I remember of reading the code a while back).

Spark SQL broadcast hash join

I'm trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html
In that example, the (small) DataFrame is persisted via saveAsTable and then there's a join via spark SQL (i.e. via sqlContext.sql("..."))
The problem I have is that I need to use the sparkSQL API to construct my SQL (I am left joining ~50 tables with an ID list, and don't want to write the SQL by hand).
How do I tell spark to use the broadcast hash join via the API? The issue is that if I load the ID list (from the table persisted via `saveAsTable`) into a `DataFrame` to use in the join, it isn't clear to me if Spark can apply the broadcast hash join.
You can explicitly mark the DataFrame as small enough for broadcasting
using broadcast function:
Python:
from pyspark.sql.functions import broadcast
small_df = ...
large_df = ...
large_df.join(broadcast(small_df), ["foo"])
or broadcast hint (Spark >= 2.2):
large_df.join(small_df.hint("broadcast"), ["foo"])
Scala:
import org.apache.spark.sql.functions.broadcast
val smallDF: DataFrame = ???
val largeDF: DataFrame = ???
largeDF.join(broadcast(smallDF), Seq("foo"))
or broadcast hint (Spark >= 2.2):
largeDF.join(smallDF.hint("broadcast"), Seq("foo"))
SQL
You can use hints (Spark >= 2.2):
SELECT /*+ MAPJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCASTJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCAST(small) */ *
FROM large JOIN small
ON larger.foo = small.foo
R (SparkR):
With hint (Spark >= 2.2):
join(large, hint(small, "broadcast"), large$foo == small$foo)
With broadcast (Spark >= 2.3)
join(large, broadcast(small), large$foo == small$foo)
Note:
Broadcast join is useful if one of structures is relatively small. Otherwise it can be significantly more expensive than a full shuffle.
jon_rdd = sqlContext.sql( "select * from people_in_india p
join states s
on p.state = s.name")
jon_rdd.toDebugString() / join_rdd.explain() :
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states.
Problems:
uneven sharding.
Limited parallelism with 29 output partitions.
broadcaseHashJoin:
broadcast the small RDD to all worker nodes.
parallelism of the large rdd is still maintained and shuffle is not even
required.
PS: Image may ugly but informative.
With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.
As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.
So if you are trying to perform smallDF.join(largeDF)
Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!
Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.
Above mentioned example is sufficient enough to start playing with broadcast join.
Note:
Cannot modify value after creation.
If you try, change will only be on one&node

Resources