Delay in Pyspark when Joining - apache-spark

I have a broadcast hash join operation in PySpark but somehow when I am about to execute the join, it will always have a delay of 30 min (both in execution timeline and log) before the execution runs.
nothing happens from 10:14 to 10:44
nothing happens in the log either
I am merely doing broadcast hash join and the size of the table is nothing extraordinary
Normal broadcast hash join
I even try .cache().count() right before I broadcast join but the mysterious delay is always there
Any clue to solve this? I am using spark 2.4.8 in AWS EMR

Related

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join

Stream-Static join without aggregation still results in accumulating spark state

I'm using Spark Structured Streaming (PySpark) to join a Kafka Stream with a Static DataFrame using a MinHashLSH approxSimilarityJoin (which under the hood does a SortMergeJoin).
The size and volume of these messages is fairly high, which results in a state of multiple GBs after a couple hours. After a while the causes OOMerrors and crashes the process.
According to the docs, stream-static joins are stateless so I would expect no state to be accumulated over time. In this use case, we are also not interested in preserving any state other than the Kafka offset.
This image below shows the State size after a few minutes, the left side of the SortMergeJoin is the static frame, while the right side is the Kafka Stream. No watermarks or dropduplicates have been added along the way.
Does anyone have an idea on how I can reduce the size of this state or even get rid of it all together?

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

How to execute streaming-static join faster than normal to be in sync with batch trigger duration?

I am using spark-sql-2.4.1v to streaming in my PoC.
I have a scenario as below
Dataset staticDf = // previous data from hdfs/cassandra table
Dataset streamingDf = // data from kafka topic for streaming
Dataset<Row> joinDs = streamingDs.join(staticDs, streamingDs.col("companyId").equalTo(staticDs.col("company_id"), "inner"));
Even though this is working fine I have an issue with timings of the join.
Currently my streaming Tigger time is around 10 seconds. Where are this join been run for almost 1 min. So I am not getting the results in the expected time.
How can I make my join trigger at for every 10 seconds ?
Thank you.
In your case, to perform join Spark needs to read all data from Cassandra, and this is slow. As I mentioned before, you need to use DSE Analytics if you want to perform efficient join on the Dataset/Dataframe, or use joinWithCassandra/leftJoinWithCassandra from RDD API.
Update in September 2020th: support for join with Cassandra in dataframes was added in the Spark Cassandra Connector 2.5.0

Spark foreachpartition connection improvements

I have written a spark job which does below operations
Reads data from HDFS text files.
Do a distinct() call to filter duplicates.
Do a mapToPair phase and generate pairRDD
Do a reducebykey call
do the aggregation logic for grouped tuple.
now call a foreach on #5
here it does
make a call to cassandra db
create an aws SNS and SQS client connection
do some json record formatting.
publish the record to SNS/SQS
when I run this job it creates three spark stages
first stage - it takes nearly 45 sec . performs a distinct
second stage - mapToPair and reducebykey = takes 1.5 mins
third stage = takes 19 mins
what I did
I turned off cassandra call so see DB hit cause - this is taking less time
Offending part I found is to create SNS/SQS connection foreach partition
its taking more than 60% of entire job time
I am creating SNS/SQS Connection within foreachPartition to improve less connections. do we have even better way
I Cannot create connection object on the driver as these are not serializable
I am not using number of executor 9 , executore core 15 , driver memory 2g, executor memory 5g
I am using 16 core 64 gig memory
cluster size 1 master 9 slave all same configuration
EMR deployment spark 1.6
It sounds like you would want to set up exactly one SNS/SQS connection per node and then use it to process all of your data on each node.
I think foreachPartition is the right idea here, but you might want to coalesce your RDD beforehand. This will collapse partitions on the same node without shuffling, and will allow you to avoid starting extra SNS/SQS connections.
See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD#coalesce(numPartitions:Int,shuffle:Boolean,partitionCoalescer:Option[org.apache.spark.rdd.PartitionCoalescer])(implicitord:Ordering[T]):org.apache.spark.rdd.RDD[T]

Resources