apache-spark-Cost Based Optimizer(CBO) stats are not used while evaluating query plans in Spark Sql - apache-spark

We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code}
spark.sql.cbo.enabled true spark.experimental.extrastrategies
intervaljoin spark.sql.cbo.joinreorder.enabled true {code}
The tables that we are using are not partitioned.
Please let me know if you need further details.

You provide little detail. Check if all steps have been followed as set out below pls.
From Spark 2.2 when I last looked at this excellent article: https://www.waitingforcode.com/apache-spark-sql/spark-sql-cost-based-optimizer/read
the following:
Spark SQL implementation
At the time of writing (2.2.0 released) Spark SQL Cost Based Optimization is disabled by default and can be activated through spark.sql.cbo.enabled property. When enabled, it applies in: filtering, projection, joins and aggregations, as we can see in corresponding estimation objects from org.apache.spark.sql.catalyst.plans.logical.statsEstimation package: FilterEstimation, ProjectEstimation, JoinEstimation and AggregateEstimation.
Even if at first glance the use of estimation objects seems to be conditioned only by the configuration property, it's not always the case. The Spark's CBO is applied only when the statistics about manipulated data are known (read more about them in the post devoted to Statistics in Spark SQL). This condition is expressed by EstimationUtils method:
def rowCountsExist(conf: SQLConf, plans: LogicalPlan*): Boolean =
plans.forall(_.stats(conf).rowCount.isDefined)
The filtering is an exception because it's checked against the number of rows existence:
if (childStats.rowCount.isEmpty) return None
The statistics can be gathered by the execution of ANALYZE TABLE $TABLE_NAME COMPUTE STATISTICS command before the processing execution. When ANALYZE command is called, it's executed by:
org.apache.spark.sql.execution.command.AnalyzeTableCommand#run(SparkSession) that updates
org.apache.spark.sql.catalyst.catalog.SessionCatalog statistics of processed data.
The only problem with ANALYZE command is that it can be called only for Hive and in-memory data stores.
Also, CBO does not work properly with Partitioned Hive Parquet tables; CBO only gives the size and not the number of rows estimated.

Related

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but itโ€™s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join

Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine?

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

The basic statistics estimation for the tables of Spark SQL

I know we could explicitly ANALYZE the table in Spark SQL so we could get some exact statistics.
However, is it possible that there exists some utilities in Catalyst which does not require explicitly scan the entire table but it could give me some rough statistics. I don't really care about the real size of a table, I only care about the relative size between tables. So I could use this info to decide which table is larger than others during query compilation.
There are two utilities in Catalyst:
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.{BasicStatsPlanVisitor,SizeInBytesOnlyStatsPlanVisitor}
But it looks like they both require explicitly scan the table.
Thanks.
There are two ways, either the stats will be taken from metastore, which requires running ANALYZE in advance (scan over data) or the stats (only SizeInBytes actually) will be estimated using InMemoryFileIndex which does not require scanning over the data but using Hadoop api Spark gathers size of each file.
Which of these methods will be used depends on more settings. For example if the SizeInBytes is available in metastore and CBO (cost based optimization) is enabled by configuration setting
spark.cbo.enabled
, Spark will take it from metastore. If CBO is off (which is default value in Spark 2.4), Spark will use InMemoryFileIndex. If SizeInBytes is not available in metastore Spark can still use either CatalogFileIndex or InMemoryFileIndex. CatalogFileIndex will be used for example if your table is partitioned, more specifically if this is satisfied (taken directly from the Spark source code):
val useCatalogFileIndex = sparkSession.sqlContext.conf.manageFilesourcePartitions && catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog && catalogTable.get.partitionColumnNames.nonEmpty
In this case if the stats are not in metastore, Spark will use defaultSizeInBytes from a configuration setting:
spark.sql.defaultSizeInBytes
which is by default Long.MaxValue, so the size will be overestimated to maximum value. I guess this is the worst scenario, the stats are not in metastore, but Spark is looking for them there using CatalogFileIndex, it does not find it and thus uses very large unrealistic value.

Is there way to get a rowcount on a query using Snowflake and its Spark Connector?

I am running a query in my Spark application that returns a substantially large amount of data. I would like to know how many rows of data are being queried for logging purposes. I can't seem to find a way to get the number of rows without either manually counting them, or calling a method to count for me, as the data is fairly large this gets expensive for logging. Is there a place that the rowcount is saved and available to grab?
I have read here that the Python connector saves the rowcount into the object model, but i can't seem to find any equivalent for the Spark Connector or its underlying JDBC.
The most optimal way I can find is rdd.collect().size on the RDD that Spark provides. It is about 15% faster than calling rdd.count()
Any help is appreciated ๐Ÿ˜ƒ
The limitation is within Spark's APIs that do not directly offer metrics of a completed distributed operation such as a row count metric after a save to table or file. Snowflake's Spark Connector is limited to the calls Apache Spark offers for its integration, and the cursor attributes otherwise available in the Snowflake Python and JDBC Connectors are not accessible through Py/Spark.
The simpler form of the question of counting an executed result, removing away Snowflake specifics, has been discussed previously with solutions: Spark: how to get the number of written rows?

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Resources