Spark SQL: why does not Spark do broadcast all the time - apache-spark

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?

Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

Related

How does cartesian product join transfer data internally for join in spark?

We have 2 types of nested loop join in spark
Broadcast nested loop join,
Shuffle replicate nested loop join(cartesian)
I know that in the broadcast one , the smaller table is broadcasted to all the nodes for the join.
I am not sure what happens in cartesian join.
Say we have 2 Dataframes A and B. Then each partition of A need all the partition of B for the join. So in a way B will need to be replicated in all the nodes where partition of A is present. Does it not make it same as B being broadcasted?
Please correct my understanding.Thanks!
I will have a go. I encourage edits to improve the answer if needed.
Cartesian (cross) join is a shuffle join. a shuffle is best defined as a computational realignment that results into inter/intra executor core communication or data share. Shuffle joins results into worker nodes and potentially every executor core communicating with one another during the entire join process. They are damn expensive because the network can easily give in courtesy of the traffic congestion caused by excessive communication between the worker nodes.
Note broadcast does not occur. The driver using partition properties at its disposal reads each df and distributes the data to the worker nodes.
To demonstrate there is no broadcast, lets cross join 2 dfs with 100000000 rows and review the DAG. In this case, the join keys in the2 dfs had string values and white spaces too. In this case I have a worker node with 2 cores
The two dfs are read into executor. The join keys are parallelly partitioned. These join keys are not returned to the driver. Each executor core holds the partitions in memory and stores them as shuffle files.
Next the cartesian product occurs. Partitions output above are combined before new partitions are computed.
Because this is happening within one executor core residing on a worker node, there is no further data exchange. Consequently, the partitions are zipped, and the join happens. The result is forwarded to the driver, the driver communicates the result to your application and its displayed.

How does merge-sort join work in Spark and why can it throw OOM?

I want to understand the concept of merge-sort join in Spark in depth.
I understand the overall idea: this is the same approach as in merge sort algorithm: Take 2 sorted datasets, compare first rows, write smallest one, repeat.
I also understand how I can implement distributed merge sort.
But I cannot get how it is implemented in Spark with respect to concepts of partitions and executors.
Here is my take.
Given I need to join 2 tables A and B. Tables are read from Hive via Spark SQL, if this matters.
By default Spark uses 200 partitions.
Spark then will calculate join key range (from minKey(A,B) to maxKey(A,B)
) and split it into 200 parts. Both datasets to be split by key
ranges into 200 parts: A-partitions and B-partitions.
Each A-partition and each B-partition that relate to same key are sent to same executor and are
sorted there separatelt from each other.
Now 200 executors can join 200 A-partitions with 200 B-partitions
with guarantee that they share same key range.
The join happes via merge-sort algo: take smallest key from
A-partition, compare with smallest key from B-partition, write
match, or iterate.
Finally, I have 200 partitions of my data which are joined.
Does it make sense?
Issues:
Skewed keys. If some key range comprises 50% of dataset keys, some executor would suffer, because too many rows would go to the same partition.
It can even fail with OOM, while trying to sort too big A-partition or B-partition in memory (I cannot get why Spark cannot sort with disk spill, as Hadoop does?..) Or maybe it fails because it tries to read both partitions into memory for joining?
So, this was my guess. Could you please correct me and help to understand the way Spark works?
This is a common problem with joins on MPP databases and Spark is no different. As you say, to perform a join, all the data for the same join key value must be colocated so if you have a skewed distribution on the join key, you have a skewed distribution of data and one node gets overloaded.
If one side of the join is small you could use a map side join. The Spark query planner really ought to do this for you but it is tunable - not sure how current this is but it looks useful.
Did you run ANALYZE TABLE on both tables?
If you have a key on both sides that won't break the join semantics you could include that in the join.
why Spark cannot sort with disk spill, as Hadoop does?
Spark merge-sort join does spill to disk. Taking a look at Spark SortMergeJoinExec class, it uses ExternalAppendOnlyUnsafeRowArray which is described as:
An append-only array for UnsafeRows that strictly keeps content in an in-memory array until numRowsInMemoryBufferThreshold is reached post which it will switch to a mode which would flush to disk after numRowsSpillThreshold is met (or before if there is excessive memory consumption)
This is consistent with the experience of seeing tasks spilling to disk during a join operation from the Web UI.
why [merge-sort join] can throw OOM?
From the Spark Memory Management overview:
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
i.e. in the case of join, increase spark.sql.shuffle.partitions to reduce the size of the partitions and the resulting hash table and correspondingly reduce the risk of OOM.

What Transformation should I apply on Spark DataFrame

I have 2 Spark dataframes (A & B) having a common column/field in both (which is a primary key in DataFrame A but not in B).
For each record/row in dataframe A, there are multiple records in dataframe B.
Based on that common column value I want to fetch all records from dataframe B against each record in dataframe A.
What kind of transformation should I perform in order to collect the records together without doing much shuffling?
To combine the records from 2 or more spark Dataframes, join is necessary.
If your data is not partitioned / bucketed well, it will lead to a Shuffle join. In which every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining). These joins are expensive because the network can become congested with traffic.
The shuffle can be avoided if:
Both Dataframes have a known partitioner or Bucketized.
One of the datasets is small enough to fit in memory, in which case we can do a broadcast hash join
Partitioning
If you partition your data correctly prior to a join, you can end up with much more efficient execution because even if a shuffle is planned, if data from two different DataFrames is already located on the same machine, Spark can avoid the shuffle.
df1.repartition(col("id"))
df2.repartition(col("id"))
// you can optionally specify the number of partitions like:
df1.repartition(10, col("id"))
// Join Dataframes on id column
df1.join(df2, "id") // this will avoid the duplicate id columns in output DF.
Broadcast Hash join
When one of the Dataset is small enough to fit into the memory of a single worker node, , we can optimize our join.
Spark will replicate the small DataFrame onto every worker node in the cluster (be it located on one machine or many). Now this sounds expensive. However, what this does is prevent us from performing the all-to-all communication during the entire join process. Instead, it performs only once at the beginning and then let each individual worker node perform the work without having to wait or communicate with any other worker node.
import org.apache.spark.sql.functions.broadcast
// explicitly specify the broadcast hint, though spark handles it.
df1.join(broadcast(df2), "id")

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?
If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
ex:
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
A.join(B, Seq("id"))
Reference is from Learning Spark book.
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.

Spark broadcast join loads data to driver

As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.
Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.
"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.

Resources