broadcast() multiple times the same df. Is it cached? - apache-spark

In our application, it happens we join the same dataframe multiple times with several other dataframes (not always the same joining column), in separate queries.
This left-hand side df is not very large, so a broadcast hint may be beneficial.
My questions :
if the same df get broadcast multiple times, will the transfer occur once (the broadcast data is somehow cached on executors), or multiple times ?
If the join concern different cols will it be cached as well, or what is broadcast depends on the join key ?

The heuristics the Spark optimizer uses change as the product evolves. The only way to know how your join would be handled is via explain. Take a look at this presentation for information on how to optimize your joins.
As for the specific question about repeated broadcast joins of the same dataframe, whether Spark will broadcast the dataframe once or more than once will depend on the memory state of the workers. If the initially broadcast dataframe is still present on the workers, Spark will not broadcast it again (from what I remember of reading the code a while back).

Related

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

What Transformation should I apply on Spark DataFrame

I have 2 Spark dataframes (A & B) having a common column/field in both (which is a primary key in DataFrame A but not in B).
For each record/row in dataframe A, there are multiple records in dataframe B.
Based on that common column value I want to fetch all records from dataframe B against each record in dataframe A.
What kind of transformation should I perform in order to collect the records together without doing much shuffling?
To combine the records from 2 or more spark Dataframes, join is necessary.
If your data is not partitioned / bucketed well, it will lead to a Shuffle join. In which every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining). These joins are expensive because the network can become congested with traffic.
The shuffle can be avoided if:
Both Dataframes have a known partitioner or Bucketized.
One of the datasets is small enough to fit in memory, in which case we can do a broadcast hash join
Partitioning
If you partition your data correctly prior to a join, you can end up with much more efficient execution because even if a shuffle is planned, if data from two different DataFrames is already located on the same machine, Spark can avoid the shuffle.
df1.repartition(col("id"))
df2.repartition(col("id"))
// you can optionally specify the number of partitions like:
df1.repartition(10, col("id"))
// Join Dataframes on id column
df1.join(df2, "id") // this will avoid the duplicate id columns in output DF.
Broadcast Hash join
When one of the Dataset is small enough to fit into the memory of a single worker node, , we can optimize our join.
Spark will replicate the small DataFrame onto every worker node in the cluster (be it located on one machine or many). Now this sounds expensive. However, what this does is prevent us from performing the all-to-all communication during the entire join process. Instead, it performs only once at the beginning and then let each individual worker node perform the work without having to wait or communicate with any other worker node.
import org.apache.spark.sql.functions.broadcast
// explicitly specify the broadcast hint, though spark handles it.
df1.join(broadcast(df2), "id")

join DataFrames within partitions in PySpark

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.
In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.
Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.
If it is necessary, then how do I do that?
If it is necessary, then how do I do that?
How to define partitioning of DataFrame?
However it makes sense only under two conditions:
There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?
If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
ex:
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
A.join(B, Seq("id"))
Reference is from Learning Spark book.
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.

Spark Broadcasting Alternatives

Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?
Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html
Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table

Resources