How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
Dataset[T](df.sparkSession,
ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
}
Which internally calls the apply method of dataset & set the logicalPlan using ResolvedHint
val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
But what is after this. How this actually work, where is code written for that.
What if we have multiple partitions of small dataset (which we are going to broadcast), does spark combine all partitions & then broadcast?
Does it broadcast to driver first & then it goes executors.
What is BitTorrent.
Regarding 1 & 2 During broadcast join data are collected on driver and what is going on later depends on join algorith
For BroadcastHashJoin(BHJ) driver builds hashtable and then this table is distributed to executors
For BroadcastNestedLoops broadcasted dataset is distributed as array to executors
So as you can see initial structure is not kept here and whole broadcasted dataset needs to fit into driver's memory (otherwise job will fail with oom error on driver)
Regarding 3 what exactly do you want to know?
In spark there is TorrentBroadcast which is BitTorrent-like implementation of broadcast. I don't know much about it (i never had to dig so deep), but if you want to know more i think that you can start here:
TorrentBroadcast docu
TorrentBroadcast source code
HttpBroadcast docu - its other broadcast algorithm
Related
I understand the concept of the broadcast optimization.
When one of the sides in the join have small data it's better to do the shuffle just for the small side. but why isn't it possible to do this shuffle using only the executors? Why do we need to use the driver?
If each executor hold hash table to map the records between the executors I think it should work.
In the current implementation of spark broadcast - it collect the data to the driver and then shuffle it and the collect action to the driver is bottleneck that I would like to avoid.
Any ideas of how to achieve similar optimization without having the bottleneck of the driver memory?
You are correct, the current implementation requires the collection of the data to the driver before sending it across to the Executors.
There is already a JIRA ticket SPARK-17556 addressing exactly what you are proposing:
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
I have copied the proposed solution from an attached document to make this answer self-describing:
"To add a broadcastmethod to RDDto perform broadcastfrom executor, we need some support work as follow:
Construct BroadCastIdfrom driver, BroadCastManager will supply a method to do this.
// Called from driver to create new broadcast id
def newBroadcastId: Long = nextBroadcastId.getAndIncrement()
BroadCastManagercould be able to create a broadcast with specified id and a persist tag to infer this broadcast isaexecutor broadcast, and its data will be backup onthe hdfs.
In the TorrentBroadcast.writeBlockswrite the block to hdfs, readBlocksread block from local, remote, hdfs by priority.
When construct the Broadcast, we can control whether to upload broadcast data block
BroadCastManagerpost a api to put broadcast data to block manager
Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?
Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html
Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table
When using Dataframe broadcast function or the SparkContext broadcast functions, what is the maximum object size that can be dispatched to all executors?
broadcast function :
Default is 10mb but we have used till 300 mb which is controlled by spark.sql.autoBroadcastJoinThreshold.
AFAIK, It all depends on memory available. so there is no definite answer for this. what I would say is, it should be less than large dataframe and you can estimate large or small dataframe size like below...
import org.apache.spark.util.SizeEstimator
logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))
based on this you can pass broadcast hint to framework.
Also have a look at
scala doc from
sql/execution/SparkStrategies.scala
which says....
Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that
side has an explicit broadcast hint (e.g. the user applied the
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side of the join will be broadcasted
and the other side will be streamed, with no shuffling
performed. If both sides are below the
threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
Shuffle hash join: if the average size of a single
partition is small enough to build a hash table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
Also have a look at other-configuration-options
SparkContext.broadcast (TorrentBroadcast ) :
broadcast shared variable also has a property spark.broadcast.blockSize=4M
AFAIK there is no hard core limitation I have seen for this as well...
for Further information pls. see TorrentBroadcast.scala
EDIT :
However you can have look at 2GB issue Even though that was officially not declared in docs (I was not able to see anything of this kind in docs).
pls look at SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .
As of Spark 2.4, there's an upper limit of 8 GB. Source Code
Update :
The 8GB limit is still valid for Spark 3.2.1 Source Code
Update:
Still valid for Spark 3.4 Source code
Like mentioned above, the upper limit is 8GB. But when you have several files you want to broadcast, spark push all the data files to driver. The driver join those files & push to to executor nodes. In this process, if the driver's available memory is less than combined broadcast file, you will end up with out of memory error.
As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.
Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.
"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.
I am going through Spark Programming guide that says:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Considering the above, what are the use cases of broadcast variables? What problems do broadcast variables solve?
When we create any broadcast variable like below, the variable reference, here it is broadcastVar available in all the nodes in the cluster?
val broadcastVar = sc.broadcast(Array(1, 2, 3))
How long these variables available in the memory of the nodes?
If you have a huge array that is accessed from Spark Closures, for example, some reference data, this array will be shipped to each spark node with closure. For example, if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).
If you use broadcast, it will be distributed once per node using an efficient p2p protocol.
val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)
And some RDD
val rdd: RDD[Int] = ???
In this case, array will be shipped with closure each time
rdd.map(i => array.contains(i))
and with broadcast, you'll get a huge performance benefit
rdd.map(i => broadcasted.value.contains(i))
Broadcast variables are used to send shared data (for example application configuration) across all nodes/executors.
The broadcast value will be cached in all the executors.
Sample scala code creating broadcast variable at driver:
val broadcastedConfig:Broadcast[Option[Config]] = sparkSession.sparkContext.broadcast(objectToBroadcast)
Sample scala code receiving broadcasted variable at executor side:
val config = broadcastedConfig.value