I try to migrate my spark application from Spark 2.x to 3.x, but there is something weird for me.
In my application, there is a job with multiple join (maybe 40 ~ 50 dataframes join with a same base dataframe). And everything is OK for Spark 2.x, while on Spark 3.x, there is no DAG generated and no error logs either, the application seems to be suspended and I have no idea why.
And I try to force split those joins into multiple jobs, and things turn out to OK when each job has 5 joins.
Related
I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join
I am using spark-sql-2.4.1v to streaming in my PoC.
I have a scenario as below
Dataset staticDf = // previous data from hdfs/cassandra table
Dataset streamingDf = // data from kafka topic for streaming
Dataset<Row> joinDs = streamingDs.join(staticDs, streamingDs.col("companyId").equalTo(staticDs.col("company_id"), "inner"));
Even though this is working fine I have an issue with timings of the join.
Currently my streaming Tigger time is around 10 seconds. Where are this join been run for almost 1 min. So I am not getting the results in the expected time.
How can I make my join trigger at for every 10 seconds ?
Thank you.
In your case, to perform join Spark needs to read all data from Cassandra, and this is slow. As I mentioned before, you need to use DSE Analytics if you want to perform efficient join on the Dataset/Dataframe, or use joinWithCassandra/leftJoinWithCassandra from RDD API.
Update in September 2020th: support for join with Cassandra in dataframes was added in the Spark Cassandra Connector 2.5.0
Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.
The course of actions in order of preference are -
Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.
Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?
Using Kafka - run different spark-submits on the dataframe streaming through kafka.
I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.
Consider your broad question, here is a broad answer :
Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.
For the rest, it doesn't seem to be clear what you are asking.
Spark newb question: I'm making exactly the same Spark SQL query in spark-sql and in spark-shell. The spark-shell version takes about 10 seconds, while the spark-sql version takes about 20.
The spark-sql REPL gets the query directly:
spark-sql> SELECT .... FROM .... LIMIT 20
The spark-shell REPL commands are like this:
scala> val df = sqlContext.sql("SELECT ... FROM ... LIMIT 20 ")
scala> df.show()
In both cases, it's exactly the same query. Also, the query returns only a few rows because of the explicit LIMIT 20.
What's different about how the same query is executed from the different CLIs?
I'm running on Hortonworks sandbox VM (Linux CentOS) if that helps.
I think it is more about two parts,
First, it could be related the order. If you run the spark-sql first spark will be able to build the explain plan from scratch. But if you run the same query again. It could take less than the first one either from shell or sql because the explain plan will be easy to be retrieved
Second, it could be related to the spark-sql conversion to the ordering the resources. It happened multiple times. Spark-shell get the resources and start the process faster than spark-sql. You can check this from the UI or from top you will find the actual start for the spark-shell is faster than spark-sql.
How to do a repartitionByCassandraReplica or joinWithCassandraTable with the pyspark embedded with DSE (datastax-entreprise 4.8)?
First, reparttionByCassandraReplica is only available for RDD, not DataFrame (so consequently not possible for pySpark).
joinWithCassandraTable which suppose join push down to Cassandra is not possible with DataFrame (so consequently not possible for pySpark).
Sometimes, executing your Spark jobs using plain Scala code is still the best way to have optimization and perform join & predicate push down.