What happens if we use broadcast in the larger table? - apache-spark

I wanted to know what will happen if we broadcast the larger table while joining it to smaller. Also, if we have two equally large tables, what will happen when we use broadcast join in that scenario?

There are few things to consider :
Spark Upper Limit : Spark supports upto 8GB of broadcast table. If your broadcast object is more than that, it would fail.
Driver and Executor Memory : Since the table will be copied in to the memory of driver and then to executors, As long as you have enough memory , it should be broadcasted successfully.
Performance : If it is broadcasted, a portion of your memory will be reserved for that. So, whatever left will be used for further operations which might make it slow. (example if executor_memory is 8 gb, broadcasted variable is 6 gb)
So, from your question, behaviour of broadcast depends on what you broadcast, doesn't matter if the Joining table is large or small. Broadcast is an independent functionality. And Spark uses this functionality in Joins.

Related

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."**
Over view of my spark job
input size is ~35 GB
I have broadcast joined all the smaller tables with the mother table into say a dataframe1 and then i salted each big table and dataframe1 before i join it with dataframe1 (left table).
profile used:
#configure(profile=[
'EXECUTOR_MEMORY_LARGE',
'NUM_EXECUTORS_32',
'DRIVER_MEMORY_LARGE',
'SHUFFLE_PARTITIONS_LARGE'
])
using the above approach and profiles i was able to get the runtime down by 50% but i still get Shuffle Stage Failing Due To Executor Loss issues.
is there a way i can fix this?
There are multiple things you can try:
Broadcast Joins: If you have used broadcast hints to join multiple smaller tables, then the resulting table (of many smaller tables) might be too huge to be accommodated in each executor memory. So, you need to look at total size of dataframe1.
35GB is really not huge. Also try the profile "EXECUTOR_CORES_MEDIUM", which really increases the parallelism in data computation. Use Dynamic allocation (16 executors should be fine for 35GB) rather than static allocation. If 32 executors are not available at a time, the build doesn't start. "DRIVER_MEMORY_MEDIUM" should be enough.
Spark 3.0 handles skew joins by itself with Adaptive Query Execution. So, you need not use salting technique. There is a profile called "ADAPTIVE_ENABLED" with foundry that you can use. Other settings of adaptive query execution, you will have to set manually with "ctx" spark context object readily available with Foundry.
Some references for AQE:
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

What happens if a Spark broadcast join is too large?

In doing Spark performance tuning, I've found (unsurprisingly) that doing broadcast joins eliminates shuffles and improves performance. I've been experimenting with broadcasting on larger joins, and I've been able to successfully use far larger broadcast joins that I expected -- e.g. broadcasting a 2GB compressed (and much larger uncompressed) dataset, running on a 60-node cluster with 30GB memory/node.
However, I have concerns about putting this into production, as the size of our data fluctuates, and I'm wondering what will happen if the broadcast becomes "too large". I'm imagining two scenarios:
A) Data is too big to fit in memory, so some of it gets written to disk, and performance degrades slightly. This would be okay. Or,
B) Data is too big to fit in memory, so it throws an OutOfMemoryError and crashes the whole application. Not so okay.
So my question is: What happens when a Spark broadcast join is too large?
Broadcast variables are plain local objects and excluding distribution and serialization they the behave as any other object you use. If they don't fit into memory you'll get OOM. Other than memory paging there is no magic that can prevent that.
So broadcasting is not applicable for objects that may not fit into memory (and leave a lot of free memory for standard Spark operations).

What is the maximum size for a broadcast object in Spark?

When using Dataframe broadcast function or the SparkContext broadcast functions, what is the maximum object size that can be dispatched to all executors?
broadcast function :
Default is 10mb but we have used till 300 mb which is controlled by spark.sql.autoBroadcastJoinThreshold.
AFAIK, It all depends on memory available. so there is no definite answer for this. what I would say is, it should be less than large dataframe and you can estimate large or small dataframe size like below...
import org.apache.spark.util.SizeEstimator
logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))
based on this you can pass broadcast hint to framework.
Also have a look at
scala doc from
sql/execution/SparkStrategies.scala
which says....
Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that
side has an explicit broadcast hint (e.g. the user applied the
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side of the join will be broadcasted
and the other side will be streamed, with no shuffling
performed. If both sides are below the
threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
Shuffle hash join: if the average size of a single
partition is small enough to build a hash table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
Also have a look at other-configuration-options
SparkContext.broadcast (TorrentBroadcast ) :
broadcast shared variable also has a property spark.broadcast.blockSize=4M
AFAIK there is no hard core limitation I have seen for this as well...
for Further information pls. see TorrentBroadcast.scala
EDIT :
However you can have look at 2GB issue Even though that was officially not declared in docs (I was not able to see anything of this kind in docs).
pls look at SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .
As of Spark 2.4, there's an upper limit of 8 GB. Source Code
Update :
The 8GB limit is still valid for Spark 3.2.1 Source Code
Update:
Still valid for Spark 3.4 Source code
Like mentioned above, the upper limit is 8GB. But when you have several files you want to broadcast, spark push all the data files to driver. The driver join those files & push to to executor nodes. In this process, if the driver's available memory is less than combined broadcast file, you will end up with out of memory error.

Spark broadcast join loads data to driver

As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.
Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.
"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Resources