Disable spark catalyst optimizer - apache-spark

To give some background, I am trying to run TPCDS benchmark on Spark with and without Spark's catalyst optimizer. For complicated queries on smaller datasets, we might be spending more time optimizing the plans than actually executing the plans. Hence wanted to measure the performance impact of optimizers on overall execution of the query
Is there a way to disable some or all of the spark catalyst optimization rules?

This ability has been added as part of Spark-2.4.0 in SPARK-24802.
val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
"specified by their rule names and separated by comma. It is not guaranteed that all the " +
"rules in this configuration will eventually be excluded, as some rules are necessary " +
"for correctness. The optimizer will log the rules that have indeed been excluded.")
.stringConf
.createOptional
You could find the list of optimizer rules here.
But ideally, we shouldn't be disabling the rules, since most of them provide performance benefits. We should identify the rule that consumes time and check if is not useful for the query and then disable them.

I know it's not the exact answer but it can help you.
Assuming your driver is not multithreaded. (hint for optimization if Catalyst is slow? :) )
If you want to measure time spent in Catalyst, just go to Spark UI and check how much time your executors are idle, or check the list of stages/jobs.
If you have a Job started at 15:30 with duration 30seconds, and next one starts at 15:32, it probably means catalyst is taking 1:30 to optimize (assuming no driver-heavy work is done).
Or even better, just put logs before calling every action in Spark and then just check how much time passes until the task is actually sent to the executor.

Just for completing, I asked on this pull request SPARK-24802 how to do it and Takeshi Yamamuro kindly answered:
scala> Seq("abc", "def").toDF("v").write.saveAsTable("t")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND EndsWith(v#18, bc))
^^^^^^^^
+- *(1) ColumnarToRow
+- FileScan parquet default.t[v#18] ...
scala> sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LikeSimplification")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND v#18 LIKE %bc)
^^^^
+- *(1) ColumnarToRow
+- FileScan parquet default.t[v#18] ...
I hope this helps.

Related

Why would finding an aggregate of a partition column in Spark 3 take very long time?

I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3:
SELECT MIN(dt) FROM table_name
The table is stored in parquet format in S3, where each dt is a separate folder, so this seems like a pretty simple operation. There's about 3,200 days of data.
In Spark2, this query completes in ~1 minute, while in Spark3, the query takes over an hour (not sure how long exactly since it hasn't finished yet).
In Spark3, the execution plan is:
AdaptiveSparkPlan (10)
+- == Current Plan ==
HashAggregate (6)
+- ShuffleQueryStage (5)
+- Exchange (4)
+- * HashAggregate (3)
+- * ColumnarToRow (2)
+- Scan parquet table_name (1)
+- == Initial Plan ==
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Scan parquet table_name (1)
It's confusing to me how this would take a long time, as the data is already partitioned by dt. Spark only needs to determine which partitions have any rows and return the min of those.
What you're suggesting was implemented once as OptimizeMetadataOnly query optimizer rule, via JIRA SPARK-15752 "Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators".
However, it was found to cause correctness issues sometimes, when some of the partitions contained zero-row files, see JIRA SPARK-26709 "OptimizeMetadataOnlyQuery does not correctly handle the files with zero record".
Along with the fix, an internal Spark config spark.sql.optimizer.metadataOnly was added to provide a way to circumvent full-table scans "at your own risk", i.e. when you are certain that all your partitions aren't empty. Possibly, in your Spark 2 you have it set to true (or your Spark 2 doesn't include a fix at all). See also SPARK-34194 for additional discussion around it.
SPARK 3.0 deprecated this config (SPARK-31647), so most likely it is set to false in your environment, which causes Spark to scan all table partitions before aggregating the result to find min. But for the time being, you can still try setting it to true to speed up your query, just beware of the consequences.

How does spark calculate the number of reducers in a hash shuffle?

I am trying to understand hash shuffle in Spark. I am reading this article
Hash Shuffle:
Each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files.
The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side
Can you help me understand the emboldened part? How does it know the amount of partitions on the reduce side or, what does "amount of partitions on the reduce side" even mean? Is it equal to spark.sql.shuffle.partitions? If it is indeed equal to that, then what is even there to calculate? A very small example would be very helpful.
spark.sql.shuffle.partitions is just the default used when number of partitions for a shuffle isn't set explicitly. So the "calculation", at a minimum, would involve a check whether specific number of partitions was requested or should Spark use the default.
Quick example:
scala> df.repartition(400,col("key")).groupBy("key").avg("value").explain()
== Physical Plan ==
*(2) HashAggregate(keys=[key#178], functions=[avg(value#164)])
+- *(2) HashAggregate(keys=[key#178], functions=[partial_avg(value#164)])
+- Exchange hashpartitioning(key#178, 400) <<<<< shuffle across increased number of partitions
+- *(1) Project [key#178, value#164]
+- *(1) FileScan parquet adb.atable[value#164,key#178,othercolumns...] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ns1/hive/adb/atable/key=123..., PartitionCount: 3393, PartitionFilters: [isnotnull(key#178), (cast(key#178 as string) > 100)], PushedFilters: [], ReadSchema: struct<value:double,othercolumns...>
scala>
In Spark 3 and up, Adaptive Query Engine could also interject and revise that number, in attempts to optimize execution by coalescing, preserving (e.g. ENSURE_REQUIREMENTS) or increasing partitions.
EDIT: A side note -- your article is quite old (2015 was ages ago :)) and talks about pre-SparkSQL/pre-dataframe times. I'd try to find something more relevant.
EDIT 2: ...But even there, in the comments section, author rightfully says: In fact, here the question is more general. For most of the transformations in Spark you can manually specify the desired amount of output partitions, and this would be your amount of “reducers”...

Should Spark JDBC partitionColumns be recognized as DataFrame partitions?

I've used partitionColumn options to read a 300 million row table, hoping to achieve low memory/disk requirements for my ETL job (in Spark 3.0.1).
However, the explain plan shows at the start/leaf:
+- Exchange hashpartitioning[partitionCol#1, 200), true, [id=#201]
+- *(1) Scan JDBCRelation(table)[numPartitions=200] (partitionCol#1, time#2)...
I would have expected that shuffling was not necessary here, since the partitionCol was specified in the JDBC option.
There's a whole lot going on in the full plan, but every window operation partitions by partitionCol first and then other columns.
I've tried:
Ensuring my columns are declared not-null (since I saw Sort[partitionCol#1 ASC NULLS FIRST...] being injected and thought that might be an issue)
Checking dataframe partitioning: jdbcDF.rdd.partitioner is None (which seems to confirm it's not understood)
How to join two JDBC tables and avoid Exchange? leads to the Datasource v2 partitioning reporting interface (fixed in 2.3.1), but perhaps that doesn't extend to jdbc loading?

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ?
So.. I have very simple case:
case class Player (playerID: String, birthYear: Int)
val playersDs: Dataset[Player] = session.read
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv(PeopleCsv)
.as[Player]
// Let's try to find players born in 1999.
// This will work, you have compile time safety... but it will not use predicate pushdown!!!
playersDs.filter(_.birthYear == 1999).explain()
// This will work as expected and use predicate pushdown!!!
// But you can't have compile time safety with this :(
playersDs.filter('birthYear === 1999).explain()
Explain from first example will show that it's NOT doing predicate pushdown (Notice empty PushedFilters):
== Physical Plan ==
*(1) Filter <function1>.apply
+- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...
While the second sample will do it correctly (Notice PushedFilters):
== Physical Plan ==
*(1) Project [.....]
+- *(1) Filter (isnotnull(birthYear#11) && (birthYear#11 = 1999))
+- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [IsNotNull(birthYear), EqualTo(birthYear,1999)], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...
So the question is.. how can I use DS Api, and have compile time safety.., and predicate pushdown working as expected ????
Is it possible ? If not.. does this mean that DS api gives you compile time safety.. but at the cost of performance!! ??? (DF will be much faster in this case.. especially when processing large parquet files)
That's the line in your Physical Plan you should remember to know the real difference between Dataset[T] and DataFrame (which is Dataset[Row]).
Filter <function1>.apply
I keep saying that people should stay away from the typed Dataset API and keep using the untyped DataFrame API as the Scala code becomes a black box to the optimizer in too many places. You've just hit one of these and think also about the deserialization of all the objects that Spark SQL keeps away from JVM to avoid GCs. Every time you touch the objects you literally ask Spark SQL to deserialize objects and load them onto JVM that puts a lot of pressure on GC (which will get triggered more often with the typed Dataset API as compared to the untyped DataFrame API).
See UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice.
Quoting Reynold Xin after I asked the very same question on dev#spark.a.o mailing list:
The UDF is a black box so Spark can't know what it is dealing with. There
are simple cases in which we can analyze the UDFs byte code and infer what
it is doing, but it is pretty difficult to do in general.
There is a JIRA ticket for such cases SPARK-14083 Analyze JVM bytecode and turn closures into Catalyst expressions, but as someone said (I think it was Adam B. on twitter) it'd be a kind of joke to expect it any time soon.
One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
// Let's try to find players born in 1999.
// This will work, you have compile time safety... but it will not use predicate pushdown!!!
playersDs.filter(_.birthYear == 1999).explain()
The above code is equivalent to the following:
val someCodeSparkSQLCannotDoMuchOutOfIt = (p: Player) => p.birthYear == 1999
playersDs.filter(someCodeSparkSQLCannotDoMuchOutOfIt).explain()
someCodeSparkSQLCannotDoMuchOutOfIt is exactly where you put optimizations aside and let Spark Optimizer skip it.

How to know the number of Spark jobs and stages in (broadcast) join query?

I use Spark 2.1.2.
I am trying to understand various spark UI tab displays vis-a-vis as a job runs. I use spark-shell --master local and doing the following join query:
val df = Seq(
(55, "Canada", -1, "", 0),
(77, "Ontario", 55, "/55", 1),
(100, "Toronto", 77, "/55/77", 2),
(104, "Brampton", 100, "/55/77/100", 3)
).toDF("id", "name", "parentId", "path", "depth")
val dfWithPar = df.as("df1").
join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").
select($"df1.*", $"df2.name" as "parentName")
dfWithPar.show
This is the physical query plan:
== Physical Plan ==
*Project [Id#11, name#12, parentId#13, path#14, depth#15, name#25 AS parentName#63]
+- *BroadcastHashJoin [parentId#13], [Id#24], LeftOuter, BuildRight
:- LocalTableScan [Id#11, name#12, parentId#13, path#14, depth#15]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- LocalTableScan [Id#24, name#25]
I've got two questions about the query execution.
Why are there two jobs for the query?
Why are the stage view shown for both jobs identical? Below is a screenshot of the stage view of job id 1 which is exactly the same of job id 0. Why?
I use Spark 2.3.0 to answer your question (2.3.1-SNAPSHOT actually) since it is the latest and greatest at the time of this writing. That changes very little about query execution (if anything important) as the physical query plans in your 2.1.2 and my 2.3.0 are exactly the same (except the per-query codegen stage ID in round brackets).
After dfWithPar.show the structured query (that you built using Spark SQL's Dataset API for Scala) is optimized to the following physical query plan (I'm including it in my answer for better comprehension).
scala> dfWithPar.explain
== Physical Plan ==
*(1) Project [Id#11, name#12, parentId#13, path#14, depth#15, name#24 AS parentName#58]
+- *(1) BroadcastHashJoin [parentId#13], [Id#23], LeftOuter, BuildRight
:- LocalTableScan [Id#11, name#12, parentId#13, path#14, depth#15]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [Id#23, name#24]
Number of Spark Jobs
Why are there two jobs for the query?
I'd say there are even three Spark jobs.
tl;dr One Spark job is for BroadcastHashJoinExec physical operator whereas the other two are for Dataset.show.
In order to understand the query execution and the number of Spark jobs of a structured query, it is important to understand the difference between structured queries (described using Dataset API) and RDD API.
Spark SQL's Datasets and Spark Core's RDDs both describe distributed computations in Spark. RDDs are the Spark "assembler" language (akin to the JVM bytecode) while Datasets are higher-level descriptions of structured queries using SQL-like language (akin to JVM languages like Scala or Java as compared to the JVM bytecode I used earlier).
What's important is that structured queries using Dataset API eventually end up as a RDD-based distributed computation (which could be compared to how the Java or Scala compilers transform the higher-level languages to the JVM bytecode).
Dataset API is an abstraction over RDD API and when you call an action on a DataFrame or Dataset that action transforms them to RDDs.
With that, you should not be surprised that Dataset.show will in the end call RDD action that in turn will run zero, one or many Spark jobs.
Dataset.show (with numRows equals 20 by default) in the end calls showString that take(numRows + 1) to get an Array[Row].
val takeResult = newDf.select(castCols: _*).take(numRows + 1)
In other words, dfWithPar.show() is equivalent of dfWithPar.take(21) which in turn is equivalent to dfWithPar.head(21) as far as the number of Spark jobs are concerned.
You can see them and their number of jobs in the SQL tab. They should all be equal.
show or take or head all lead to collectFromPlan that triggers the Spark jobs (by calling executeCollect).
You should be certain that to answer your question about the number of jobs is to know how all the physical operators in the query work. You simply have to know their behaviour at runtime and whether they trigger Spark jobs at all.
BroadcastHashJoin and BroadcastExchangeExec Physical Operators
BroadcastHashJoinExec binary physical operator is used when the right side of a join can be broadcast (which is exactly spark.sql.autoBroadcastJoinThreshold that is 10M by default).
BroadcastExchangeExec unary physical operator is used to broadcast rows (of a relation) to worker nodes (to support BroadcastHashJoinExec).
When BroadcastHashJoinExec is executed (to generate a RDD[InternalRow]), it creates a broadcast variable that in turn executes BroadcastExchangeExec (on a separate thread).
That's why the run at ThreadPoolExecutor.java:1149 Spark job 0 was run.
You could see the single Spark job 0 ran if you executed the following:
// Just a single Spark job for the broadcast variable
val r = dfWithPar.rdd
That requires that the structured query is executed to produce a RDD that is then the target of the action to give the final result.
You would not have had the Spark job if you had not ended up with a broadcast join query.
RDD.take Operator
What I missed the very first moment when I answered the question was that the Dataset operators, i.e. show, take and head, will in the end lead to RDD.take.
take(num: Int): Array[T] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
Please note when take says "It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit." That's the key to understand the number of Spark jobs in your broadcast join query.
Every iteration (in the description above) is a separate Spark job starting with the very first partition and 4 times as many every following iteration:
// RDD.take
def take(num: Int): Array[T] = withScope {
...
while (buf.size < num && partsScanned < totalParts) {
...
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
...
}
}
Have a look at the following RDD.take with 21 rows.
// The other two Spark jobs
r.take(21)
You will get 2 Spark jobs as in your query.
Guess how many Spark jobs you will have if you executed dfWithPar.show(1).
Why Are Stages Identical?
Why are the stage view shown for both jobs identical? Below is a screenshot of the stage view of job id 1 which is exactly the same of job id 0. Why?
That's easy to answer since both Spark jobs are from RDD.take(20).
The first Spark job is to scan the first partition and since it had not enough rows led to another Spark job to scan more partitions.

Resources