steps in spark physical plan not assigned to DAG step - apache-spark

I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst engine) looks to be corrupted where some of the steps in the physical plan have not been assigned an order id and thus all evaluation on the right side of the join is not completed in the spark query
here is the example query
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')
joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ###
joined_df.explain(True)
here is an example of the physical plan returned by spark
== Physical Plan ==
SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(encntr_id#0, 200)
: +- *(1) Filter isnotnull(encntr_id#0)
: +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#a6df563
+- Sort [encntr_id#12 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(encntr_id#12, 200)
+- Filter isnotnull(encntr_id#12)
+- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#60dd22d9
Note that all datasource scan , filter exchange and sort on the right side of the join have not been assigned an order id .
Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?

Figured this out internally .
Turns out the spark optimization routine can be affected by the configuration setting
spark.sql.codegen.Maxfields
which can have implications in how spark will optimize the read from 'fat' tables .
In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .
Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan

Related

What is shufflequerystage in spark DAG?

What is the shufflequerystage box that I see in the spark DAGs. How is it different from the excahnge box in the spark stages?
There is already a nice answer here, but this is just to give you some more info on what this shufflequerystage actually is by looking at the source code.
What is a Shuffle Query Stage?
If we look at Spark's source code for the ShuffleQueryStageExec case class, we see the following:
case class ShuffleQueryStageExec(
override val id: Int,
override val plan: SparkPlan,
override val _canonicalized: SparkPlan) extends QueryStageExec {
...
}
So ShuffleQueryStageExec extends QueryStageExec. Let's have a look at QueryStageExec then. The code comments are enlightening:
A query stage is an independent subgraph of the query plan. Query stage materializes its output
before proceeding with further operators of the query plan. The data statistics of the
materialized output can be used to optimize subsequent query stages.
There are 2 kinds of query stages:
Shuffle query stage. This stage materializes its output to shuffle files, and Spark launches
another job to execute the further operators.
Broadcast query stage. This stage materializes its output to an array in driver JVM. Spark
broadcasts the array before executing the further operators.
So in (very) short, a ShuffleQueryStage is a part of your total query plan whose data statistics can be used to optimize subsequent query stages. This is all part of Adaptive Query Execution (AQE).
How is such a Shuffle Query Stage made?
To get a better feeling of how this all works, we can try to understand how the shuffle query stage is made. The AdaptiveSparkPlanExec case class is the interesting location for this.
There are a bunch of actions (collect, take, tail, execute, ...) that trigger the withFinalPlanUpdate function, which in turn triggers the getFinalPhysicalPlan function. In this function, the createQueryStages function gets called and this is where it gets interesting.
The createQueryStages function is a recursive function that travels through the whole plan tree and it looks a bit like this:
private def createQueryStages(plan: SparkPlan): CreateStageResult = plan match {
case e: Exchange =>
// First have a quick check in the `stageCache` without having to traverse down the node.
context.stageCache.get(e.canonicalized) match {
case Some(existingStage) if conf.exchangeReuseEnabled =>
...
case _ =>
val result = createQueryStages(e.child)
val newPlan = e.withNewChildren(Seq(result.newPlan)).asInstanceOf[Exchange]
// Create a query stage only when all the child query stages are ready.
if (result.allChildStagesMaterialized) {
var newStage = newQueryStage(newPlan)
...
}
So you see, if we bounce onto an Exchange that already was executed and we want to reuse it, we just do that. But if that is not the case, we will create a new plan and call the newQueryStage function.
This is where the story ends. The newQueryStage function looks like this:
private def newQueryStage(e: Exchange): QueryStageExec = {
val optimizedPlan = optimizeQueryStage(e.child, isFinalStage = false)
val queryStage = e match {
case s: ShuffleExchangeLike =>
...
ShuffleQueryStageExec(currentStageId, newShuffle, s.canonicalized)
case b: BroadcastExchangeLike =>
...
BroadcastQueryStageExec(currentStageId, newBroadcast, b.canonicalized)
}
...
}
So there we see the ShuffleQueryStageExec being made! So for each Exchange, that has not happened yet or if you're not reusing exchanges, AQE will either add a ShuffleQueryStageExec or a BroadcastQueryStageExec.
Hope this brings more insight to what this is :)
shufflequerystage are connected to AQE, they are being added after each stage with exchange and are used to materialized results after each stage and optimize remaining plan based on statistics.
So imo short answer is:
Exchange - here your data are shuffled
Shufflequerystage - added for AQE purposes to use runtime statistics and reoptimize plan
In below example i am trying to show this mechanism
Here is sample code:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", true)
val input = spark.read
.format("csv")
.option("header", "true")
.load(
"dbfs:/FileStore/shared_uploads/**#gmail.com/city_temperature.csv"
)
val dataForInput2 = Seq(
("Algeria", "3"),
("Germany", "3"),
("France", "5"),
("Poland", "7"),
("test55", "86")
)
val input2 = dataForInput2
.toDF("Country", "Value")
.withColumn("test", lit("test"))
val joinedDfs = input.join(input2, Seq("Country"))
val finalResult =
joinedDfs.filter(input("Country") === "Poland").repartition(200)
finalResult.show
I am reading data from file but you can replace it with small df created in code because i added line to disable broadcast. I added some withColumn and repartition to make it more interesting
First lets take a look at plan with AQE disabled:
== Physical Plan ==
CollectLimit (11)
+- Exchange (10)
+- * Project (9)
+- * SortMergeJoin Inner (8)
:- Sort (4)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan csv (1)
+- Sort (7)
+- Exchange (6)
+- LocalTableScan (5)
Now AQE enabled
== Physical Plan ==
AdaptiveSparkPlan (25)
+- == Final Plan ==
CollectLimit (16)
+- ShuffleQueryStage (15), Statistics(sizeInBytes=1447.8 KiB, rowCount=9.27E+3, isRuntime=true)
+- Exchange (14)
+- * Project (13)
+- * SortMergeJoin Inner (12)
:- Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=1158.3 KiB, rowCount=9.27E+3, isRuntime=true)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan csv (1)
+- Sort (11)
+- AQEShuffleRead (10)
+- ShuffleQueryStage (9), Statistics(sizeInBytes=56.0 B, rowCount=1, isRuntime=true)
+- Exchange (8)
+- LocalTableScan (7)
The code is the same, the only difference is AQE but now you can see that ShuffleQueryStage popped up after each exchange
Lets take a look at Dag visualisation as in your example.
First lets take a look at job3 which included join
Then there is job4 which just reuse what was computed previously but adds additional 4th stage with ShuffleQueryStage similar as in your case

What is SubqueryAlias node in analyzed logical plan?

I have a simple sql as follows,
test("SparkSQLTest") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
spark.range(1, 100).createOrReplaceTempView("t1")
val df = spark.sql("select id from t1 where t1.id = 10")
df.explain(true)
}
The output for the analyzed logical plan is:
== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- Filter (id#0L = cast(10 as bigint))
+- SubqueryAlias t1 ////don't understand here
+- Range (1, 100, step=1, splits=Some(1))
Why does the SubqueryAlias show up int the logical plan? In my sql, I don't have alias related operations.
Could some one help explain? Thanks!
SubqueryAlias is an unary logical operator that gives an alias for the (child) subquery it was created for. The alias can be used in another part of a structured query for a correlated subquery.
SubqueryAlias (and aliases in general) are available until Spark Optimizer has finished query optimization phase (using EliminateSubqueryAliases optimization rule).
Quoting EliminateSubqueryAliases optimization:
Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
In your query the subquery is the part before createOrReplaceTempView("t1").
spark.range(1, 100).createOrReplaceTempView("t1")
You could rewrite the above structured query into the following that would change nothing, but gives a more elaborative explanation.
val q = spark.range(1, 100)
q.createOrReplaceTempView("t1")
So, q could be any other structured query and hence the need for an alias to reference any output attribute from the subquery.
When you explain the query you won't see any SubqueryAlias nodes (and that's not only because the logical query plan gets planned to a physical query plan where physical operators are used).

What is Project node in execution query plan?

What is the meaning of Project node in Sparks execution plan?
I have a plan containing the following:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]
NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.
Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.
Quoting Wikipedia's Projection (relational algebra):
In practical terms, it can be roughly thought of as picking a subset of all available columns.
A Project node can appear in a logical query plan explicitly for the following:
Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries
A Project node can also appear analysis and optimization phases.
In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)
You could have a look at the code of Dataset.select operator.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.
NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.
NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.
Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.
Catalyst DSL
You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Good ol' SQL
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

Does spark.sql.autoBroadcastJoinThreshold work for joins using Dataset's join operator?

I'd like to know if spark.sql.autoBroadcastJoinThreshold property can be useful for broadcasting smaller table on all worker nodes (while making the join) even when the join scheme is using the Dataset API join instead of using Spark SQL.
If my bigger table is 250 Gigs and Smaller is 20 Gigs, do I need to set this config: spark.sql.autoBroadcastJoinThreshold = 21 Gigs (maybe) in order for sending the whole table / Dataset to all worker nodes?
Examples:
Dataset API join
val result = rawBigger.as("b").join(
broadcast(smaller).as("s"),
rawBigger(FieldNames.CAMPAIGN_ID) === smaller(FieldNames.CAMPAIGN_ID),
"left_outer"
)
SQL
select *
from rawBigger_table b, smaller_table s
where b.campign_id = s.campaign_id;
First of all spark.sql.autoBroadcastJoinThreshold and broadcast hint are separate mechanisms. Even if autoBroadcastJoinThreshold is disabled setting broadcast hint will take precedence. With default settings:
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
String = 10485760
val df1 = spark.range(100)
val df2 = spark.range(100)
Spark will use autoBroadcastJoinThreshold and automatically broadcast data:
df1.join(df2, Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=Some(8))
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=Some(8))
When we disable auto broadcast Spark will use standard SortMergeJoin:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df1.join(df2, Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *SortMergeJoin [id#0L], [id#3L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 100, step=1, splits=Some(8))
+- *Sort [id#3L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200)
but can forced to use BroadcastHashJoin with broadcast hint:
df1.join(broadcast(df2), Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=Some(8))
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=Some(8))
SQL has its own hints format (similar to the one used in Hive):
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql(
"SELECT /*+ MAPJOIN(df2) */ * FROM df1 JOIN df2 ON df1.id = df2.id"
).explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=8)
So to answer your question - autoBroadcastJoinThreshold is applicable when working with Dataset API, but it is not relevant when using explicit broadcast hints.
Furthermore broadcasting large objects is unlikely provide any performance boost, and in practice will often degrade performance and result in stability issue. Remember that broadcasted object has to be first fetch to driver, then send to each worker, and finally loaded into memory.
Just to share more details (from the code) to the great answer from #user6910411.
Quoting the source code (formatting mine):
spark.sql.autoBroadcastJoinThreshold configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join.
By setting this value to -1 broadcasting can be disabled.
Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run, and file-based data source tables where the statistics are computed directly on the files of data.
spark.sql.autoBroadcastJoinThreshold defaults to 10M (i.e. 10L * 1024 * 1024) and Spark will check what join to use (see JoinSelection execution planning strategy).
There are 6 different join selections and among them is broadcasting (using BroadcastHashJoinExec or BroadcastNestedLoopJoinExec physical operators).
BroadcastHashJoinExec will get chosen when there are joining keys and one of the following holds:
Join is one of CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI and right join side can be broadcast, i.e. the size is less than spark.sql.autoBroadcastJoinThreshold
Join is one of CROSS, INNER and RIGHT OUTER and left join side can be broadcast, i.e. the size is less than spark.sql.autoBroadcastJoinThreshold
BroadcastNestedLoopJoinExec will get chosen when there are no joining keys and one of the above conditions of BroadcastHashJoinExec holds.
In other words, Spark will automatically select the right join, including BroadcastHashJoinExec based on spark.sql.autoBroadcastJoinThreshold property (among other requirements) but also the join type.
there is limitation that maximum size of DF that can be broadcast is 8G.

How to enable Catalyst Query Optimiser in Spark SQL?

Whether I use Spark-SQL directly or Spark-Shell, I have no idea to check the operation of Spark Catalyst Query Optimizer in explicit way.
For example, let we assume that I made HiveContext as follows:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then, when I try to process a query as:
hiveContext.sql("""
| SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
| FROM date_dim dt, store_sales, item
| WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
| AND store_sales.ss_item_sk = item.i_item_sk
| AND item.i_manufact_id = 128
| AND dt.d_moy=11
| GROUP BY dt.d_year, item.i_brand, item.i_brand_id
| ORDER BY dt.d_year, sum_agg desc, brand_id
| LIMIT 100
""").collect().foreach(println)
Is there way to check the existence of catalyst optimizer?
If not exists, then how can we enable catalyst optimizer for HiveContext?
Catalyst Query Optimizer is always enabled in Spark 2.0. It is a part of the optimizations you get for free when you work with Spark 2.0's Datasets (and one of the many reasons you should really be using Datasets before going low level with RDDs).
If you want to see the optimizations Catalyst Query Optimizer applied to your query, use TRACE logging level for SparkOptimizer in conf/log4j.properties:
log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE
With that whenever you trigger execution of your query (through show, collect or a mere explain) you'll see tons of logs with the work Catalyst Query Optimizer is doing for you every time you execute a query.
Let's see Column Pruning optimization rule in action.
// the business object
case class Person(id: Long, name: String, city: String)
// the dataset to query over
val dataset = Seq(Person(0, "Jacek", "Warsaw")).toDS
// the query
// Note that we work with names only (out of 3 attributes in Person)
val query = dataset.groupBy(upper('name) as 'name).count
scala> query.explain(extended = true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L] Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
!+- LocalRelation [id#125L, name#126, city#127] +- Project [name#126]
! +- LocalRelation [id#125L, name#126, city#127]
...
== Parsed Logical Plan ==
'Aggregate [upper('name) AS name#160], [upper('name) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Analyzed Logical Plan ==
name: string, count: bigint
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Optimized Logical Plan ==
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [name#126]
== Physical Plan ==
*HashAggregate(keys=[upper(name#126)#171], functions=[count(1)], output=[name#160, count#166L])
+- Exchange hashpartitioning(upper(name#126)#171, 200)
+- *HashAggregate(keys=[upper(name#126) AS upper(name#126)#171], functions=[partial_count(1)], output=[upper(name#126)#171, count#173L])
+- LocalTableScan [name#126]

Resources