I am trying to understand how Adaptive query execution and spark.sql.shuffle.partitions interact in spark 2.4 (though if this changed in spark 3.0 that would be interesting to note as well).
If I set AQE to true (unlike spark 3.0 it is False by default in spark 2.4), could it choose higher and lower number of partitions? Or does it depend on whether I set spark.sql.adaptive.coalescePartitions.enabled to true?
In my rather large application, my code used to crash until I specified enough partitions. AQE was enabled but somehow wasn't able to do that for me. After setting the partitions explicitly (With AQE still on), the code works reliably.
Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)

Tungsten encoding in Spark SQL?

I am running a Spark application that has a series of Spark SQL statements that are executed one after the other. The SQL queries are quite complex and the application is working (generating output). These days, I am working towards improving the performance of processing within Spark.
Please suggest whether Tungsten encoding has to be enabled separately or it kicks in automatically while running Spark SQL?
I am using Cloudera 5.13 for my cluster (2 node).
It is enabled by default in spark 2.X (and maybe 1.6: but i'm not sure on that).
In any case you can do this
That can be enabled on the spark-submit as follows:
spark-submit --conf spark.sql.tungsten.enabled=true
Tungsten should be enabled if you see a * next to the plan:
Also see: How to enable Tungsten optimization in Spark 2?
Tungsten became the default in Spark 1.5 and can be enabled in an earlier version by setting the spark.sql.tungsten.enabled = true.
Even without Tungsten, SparkSQL uses a columnar storage format with Kyro serialization to minimize storage cost.
To make sure your code benefits as much as possible from Tungsten optimizations try to use the default Dataset API with Scala (instead of RDD).
Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. DataSet APIs are the most up to date and adds type-safety along with better error handling and far more readable unit tests.

How to enable tungsten sort shuffle in Spark 2.1?

In previous versions, there is a configuration called spark.shuffle.manage which is used to determine the type of shuffle algorithm in Spark. Since Spark 2.0, this configuration is deleted. The default shuffle algorithm is the sort-based. As my understanding, the Tungsten shuffle will be enabled only if all the requirements are satisfied. How can I know whether the current job uses the original sort-based shuffle or Tungsten sort?
Thank you very much.
SortShuffleManager is the one and only ShuffleManager in Apache Spark.
In other words, there's no way you could use any other ShuffleManager but SortShuffleManager (unless you enabled one using spark.shuffle.manager property).

Does spark ensure datalocality?

When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.

Does Spark SQL include a table streaming optimization for joins?

Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?
When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.
I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.
If you are interestend in further improving join performance, I highly recommend this presentation.
This is the upcoming Spark 2.3 here (RC2 is being voted for the next release).
As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
It does not in the latest (and voted to be released soon) Spark 2.3 either.
There is no support for STREAMTABLE hint, but given the recent change (in SPARK-20857 Generic resolved hint node) to build a hint framework that should be fairly easy to write.
You'd have to write some Spark optimizations and possibly physical plan(s) that would support STREAMTABLE (which seems like a lot of work) but it's possible. The tools are there.
Regarding join optimizations, in the upcoming Spark 2.3 there are two main logical optimizations:
CostBasedJoinReorder (exclusively for cost-based optimization)
