Spark broadcast join loads data to driver - apache-spark

As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.

Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.

"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."

It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.

Related

What happens if we use broadcast in the larger table?

I wanted to know what will happen if we broadcast the larger table while joining it to smaller. Also, if we have two equally large tables, what will happen when we use broadcast join in that scenario?
There are few things to consider :
Spark Upper Limit : Spark supports upto 8GB of broadcast table. If your broadcast object is more than that, it would fail.
Driver and Executor Memory : Since the table will be copied in to the memory of driver and then to executors, As long as you have enough memory , it should be broadcasted successfully.
Performance : If it is broadcasted, a portion of your memory will be reserved for that. So, whatever left will be used for further operations which might make it slow. (example if executor_memory is 8 gb, broadcasted variable is 6 gb)
So, from your question, behaviour of broadcast depends on what you broadcast, doesn't matter if the Joining table is large or small. Broadcast is an independent functionality. And Spark uses this functionality in Joins.

Does Broadcasting a DataFrame spill the data to disk if the memory isn't available to hold the data?

I have a question around spark Broadcast join.
By default the Broadcast hash join size is 10MB.
case1: we have enough Memory in cluster to hold the Broadcast DF.
If the DF size is greater than the default broadcast join size, say 15 MB is the DF size, and If I broadcast this DF across all the nodes in the cluster, will it still perform a broadcast join?
since 15MB is greater than default broadcast join size, will it go for any other join even though we have broadcast-ed the DF?
case2: Not enough Memory in cluster to hold the Broadcasted DF.
So let us suppose if I have 15MB Data Frame and If I want to Broadcast this Data Frame during the join, and the memory isn't available on say one or few nodes to hold this data.(15MB is a hypothetical number)
Will it fail with out of Memory error or will it spill the data to disk?
If you are trying to broadcast a dataframe larger than spark.sql.autoBroadcastJoinThreshold, Spark will issue an error.
I can't back it up by official documentation but I don't think there will be a spill to disk. You need to make sure both the driver and the worker can accommodate the full dataframe.

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

How can I optimize my spark application to join two rdd having size greater than cluster memory?

I want to join two RDD each taking 10 GB of memory. But the cluster memory I am having is just 15 GB. Is it possible to optimize the code somehow so that we can join these RDD?
I thought of persisting the RDD in DISK but it seems to be not working.
IS there any optimization technique that we can use to encounter such problem?
It is not a necessary condition that the cluster should have more memory than the dataset. However, that helps to increase the performance.
The persist to DISK_ONLY won't help if you have a single join. In case you are trying to multiple joins you would need to persist and count to force the DAG evaluation.
Anyway, the best way is to increase the Dataset partitions and shuflle partition (200 is default).
spark.sql.shuffle.partitions=5000
and then join.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Resources