what would be the scope of a broadcasted and repartitioned dataframe? - apache-spark

If I have a spark dataframe(rightdf) ~4gb, can be used this part of code in a join:
leftdf.join(broadcast(rightdf.repartition(2)))?

It doesn't make sense to repartition rightdf before broadcast it. In fact, when you're using a broadcast join, the whole broadcast dataframe (in your case it's rightdf) while be sending to each node in the cluster.
So calling repartition before broadcast while only decrease your performance.

Related

How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
Dataset[T](df.sparkSession,
ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
}
Which internally calls the apply method of dataset & set the logicalPlan using ResolvedHint
val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
But what is after this. How this actually work, where is code written for that.
What if we have multiple partitions of small dataset (which we are going to broadcast), does spark combine all partitions & then broadcast?
Does it broadcast to driver first & then it goes executors.
What is BitTorrent.
Regarding 1 & 2 During broadcast join data are collected on driver and what is going on later depends on join algorith
For BroadcastHashJoin(BHJ) driver builds hashtable and then this table is distributed to executors
For BroadcastNestedLoops broadcasted dataset is distributed as array to executors
So as you can see initial structure is not kept here and whole broadcasted dataset needs to fit into driver's memory (otherwise job will fail with oom error on driver)
Regarding 3 what exactly do you want to know?
In spark there is TorrentBroadcast which is BitTorrent-like implementation of broadcast. I don't know much about it (i never had to dig so deep), but if you want to know more i think that you can start here:
TorrentBroadcast docu
TorrentBroadcast source code
HttpBroadcast docu - its other broadcast algorithm

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

Can you do a broadcast join with SparkR?

I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.
However I couldn't find the broadcast function in the SparkR documentation.
So I'm wondering if you can do a broadcast join with SparkR?
Spark 2.3:
There will be broadcast function created in this pull request: https://github.com/apache/spark/pull/17965/files
Spark 2.2:
You can provide custom hint to query:
head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))
Reference: this code: https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L3740
Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.
Side note:
Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:
spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")
Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here

broadcast() multiple times the same df. Is it cached?

In our application, it happens we join the same dataframe multiple times with several other dataframes (not always the same joining column), in separate queries.
This left-hand side df is not very large, so a broadcast hint may be beneficial.
My questions :
if the same df get broadcast multiple times, will the transfer occur once (the broadcast data is somehow cached on executors), or multiple times ?
If the join concern different cols will it be cached as well, or what is broadcast depends on the join key ?
The heuristics the Spark optimizer uses change as the product evolves. The only way to know how your join would be handled is via explain. Take a look at this presentation for information on how to optimize your joins.
As for the specific question about repeated broadcast joins of the same dataframe, whether Spark will broadcast the dataframe once or more than once will depend on the memory state of the workers. If the initially broadcast dataframe is still present on the workers, Spark will not broadcast it again (from what I remember of reading the code a while back).

Spark broadcast join loads data to driver

As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.
Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.
"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.

Resources