Spark Broadcast results in increased size of dataframe - apache-spark

I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.
Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB
Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.
Any reasoning or help is appreciated.

The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.
Do not broadcast big dataframes, only small ones, to use in join operations.
As per link:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321

Related

Join 2 large size tables (50 Gb and 1 billion records)

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :
I need to tune it, as I am getting OOM errors due to Java heap space.
I have to apply left join.
There will not be any null values, so it might not improve performance.
What should I do to achieve this scenario?
Jfyi: While loading this parquet data I have already applied repartition based on a column.
I have loaded both df1 and df2
When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
Applied repartitioning on both df to evenly distribute the data

Does Broadcasting a DataFrame spill the data to disk if the memory isn't available to hold the data?

I have a question around spark Broadcast join.
By default the Broadcast hash join size is 10MB.
case1: we have enough Memory in cluster to hold the Broadcast DF.
If the DF size is greater than the default broadcast join size, say 15 MB is the DF size, and If I broadcast this DF across all the nodes in the cluster, will it still perform a broadcast join?
since 15MB is greater than default broadcast join size, will it go for any other join even though we have broadcast-ed the DF?
case2: Not enough Memory in cluster to hold the Broadcasted DF.
So let us suppose if I have 15MB Data Frame and If I want to Broadcast this Data Frame during the join, and the memory isn't available on say one or few nodes to hold this data.(15MB is a hypothetical number)
Will it fail with out of Memory error or will it spill the data to disk?
If you are trying to broadcast a dataframe larger than spark.sql.autoBroadcastJoinThreshold, Spark will issue an error.
I can't back it up by official documentation but I don't think there will be a spill to disk. You need to make sure both the driver and the worker can accommodate the full dataframe.

Spark seems to think that a particular broadcast variable is large in size

I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB.
What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoin instead of a broadcast join. But the size of the smaller table shouldn't be this big at all. I wrote the smaller table to a s3 folder and it took only 12.6 MB of space.
I did some operations on the smaller table so the shuffle size appears on the Spark History Server and the size in memory seemed to be 150 MB, nowhere near 10G. Also, if I force a broadcast join on the smaller table it takes a long time to broadcast, leading me to think that the table might not just be 150 MB in size.
What would be a good way to figure out the actual size that Spark is seeing and deciding whether it crosses the value defined by spark.sql.autoBroadcastJoinThreshold?
look at the SQL tab in the spark UI. there you will see the DAG of each job + the statistics that spark collects.
for each dataframe, it will contain the size as spark sees it.
BTW, you don't have set spark.sql.autoBroadcastJoinThreshold to a high number to force spark using the broadcast join.
you can simply wrap the small df with org.apache.spark.sql.functions.broadcast(df) and it will force broadcast only on that specific join
As mentioned in this question: DataFrame join optimization - Broadcast Hash Join
import org.apache.spark.sql.functions.broadcast
val employeesDF = employeesRDD.toDF
val departmentsDF = departmentsRDD.toDF
// materializing the department data
val tmpDepartments = broadcast(departmentsDF.as("departments"))
import context.implicits._
employeesDF.join(broadcast(tmpDepartments),
$"depId" === $"id", // join by employees.depID == departments.id
"inner").show()

Why join in spark in local mode is so slow?

I am using spark in local mode and a simple join is taking too long. I have fetched two dataframes: A (8 columns and 2.3 million rows) and B(8 columns and 1.2 million rows) and joining them using A.join(B,condition,'left') and called an action at last. It creates a single job with three stages, each for two dataframes extraction and one for joining. Surprisingly stage with extraction of dataframe A is taking around 8 minutes and that of dataframe B is taking 1 minute. And join happens within seconds. My important configuration settings are:
spark.master local[*]
spark.driver.cores 8
spark.executor.memory 30g
spark.driver.memory 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 16
The only executor is driver itself. While extracting dataframes, i have partitioned it in 32(also tried 16,64,50,100,200) parts. I have seen shuffle write memory to be 100 MB for Stage with dataframe A extraction. So to avoid shuffle i made 16 initial partitions for both dataframes and broadcasted dataframe B(smaller), but it is not helping. There is still shuffle write memory. I have used broadcast(B) syntax for this. Am I doing something wrong? Why shuffling is still there? Also when i see event timelines its showing only four cores are processing at any point of time. Although I have a 2core*4 processor machine.Why is that so?
In short, "Join"<=>Shuffling, the big question here is how uniformly are your data distributed over partitions (see for example https://0x0fff.com/spark-architecture-shuffle/ , https://www.slideshare.net/SparkSummit/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning and just Google the problem).
Few possibilities to improve efficiency:
think more about your data (A and B) and partition data wisely;
analyze, are your data skewed?;
go into UI and look at the tasks timing;
choose such keys for partitions that during "join" only few partitions from dataset A shuffle with few partitions of B;

Spark broadcast join loads data to driver

As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download
1gb from these. Again, total amount is 4 gb and it takes 4s.
Firstly broadcast join is used for joining a big table and an extremely small table.
Then if using shuffle instead of collecting the small df(table) back to driver and then broadcast, you only noticed that the small df is shuffled, but actually the big df is also shuffled at the same time, which is quite time consuming.
"It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
-- that right, spark team is working on that:
https://issues.apache.org/jira/browse/SPARK-17556
"Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors."
It is not correct. Spark doesn't use broadcasting for RDD joins.
Spark may use broadcasting for DataFrame joins but it shouldn't be used to handle large objects. It is better to use standard HashJoin for that.

Resources