What is SubqueryAlias node in analyzed logical plan? - apache-spark

I have a simple sql as follows,
test("SparkSQLTest") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
spark.range(1, 100).createOrReplaceTempView("t1")
val df = spark.sql("select id from t1 where t1.id = 10")
df.explain(true)
}
The output for the analyzed logical plan is:
== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- Filter (id#0L = cast(10 as bigint))
+- SubqueryAlias t1 ////don't understand here
+- Range (1, 100, step=1, splits=Some(1))
Why does the SubqueryAlias show up int the logical plan? In my sql, I don't have alias related operations.
Could some one help explain? Thanks!

SubqueryAlias is an unary logical operator that gives an alias for the (child) subquery it was created for. The alias can be used in another part of a structured query for a correlated subquery.
SubqueryAlias (and aliases in general) are available until Spark Optimizer has finished query optimization phase (using EliminateSubqueryAliases optimization rule).
Quoting EliminateSubqueryAliases optimization:
Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
In your query the subquery is the part before createOrReplaceTempView("t1").
spark.range(1, 100).createOrReplaceTempView("t1")
You could rewrite the above structured query into the following that would change nothing, but gives a more elaborative explanation.
val q = spark.range(1, 100)
q.createOrReplaceTempView("t1")
So, q could be any other structured query and hence the need for an alias to reference any output attribute from the subquery.
When you explain the query you won't see any SubqueryAlias nodes (and that's not only because the logical query plan gets planned to a physical query plan where physical operators are used).

Related

steps in spark physical plan not assigned to DAG step

I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst engine) looks to be corrupted where some of the steps in the physical plan have not been assigned an order id and thus all evaluation on the right side of the join is not completed in the spark query
here is the example query
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')
joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ###
joined_df.explain(True)
here is an example of the physical plan returned by spark
== Physical Plan ==
SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(encntr_id#0, 200)
: +- *(1) Filter isnotnull(encntr_id#0)
: +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#a6df563
+- Sort [encntr_id#12 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(encntr_id#12, 200)
+- Filter isnotnull(encntr_id#12)
+- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#60dd22d9
Note that all datasource scan , filter exchange and sort on the right side of the join have not been assigned an order id .
Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?
Figured this out internally .
Turns out the spark optimization routine can be affected by the configuration setting
spark.sql.codegen.Maxfields
which can have implications in how spark will optimize the read from 'fat' tables .
In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .
Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan

What is Project node in execution query plan?

What is the meaning of Project node in Sparks execution plan?
I have a plan containing the following:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]
NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.
Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.
Quoting Wikipedia's Projection (relational algebra):
In practical terms, it can be roughly thought of as picking a subset of all available columns.
A Project node can appear in a logical query plan explicitly for the following:
Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries
A Project node can also appear analysis and optimization phases.
In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)
You could have a look at the code of Dataset.select operator.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.
NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.
NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.
Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.
Catalyst DSL
You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Good ol' SQL
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

Union in Spark SQL query removing duplicates from Dataset

I am using Java API for Apache Spark , and i have two Dataset A & B.
The schema for these both is same : PhoneNumber, Name, Age, Address
There is one record in both the Dataset that has PhoneNumber as common, but other columns in this record are different
I run following SQL query on these two Datasets (by registering these as temporary Table):
A.createOrReplaceTempView("A");
B.createOrReplaceTempView("B");
String query = "Select * from A UNION Select * from B";
Dataset<Row> result = sparkSession.sql(query);
result.show();
Surprisingly, the result has only one record with same PhoneNumber, and the other is removed.
I know UNION is SQL query is intended to remove duplicates, but then it also needs to know the Primary Key on the basis of which it decides what is duplicate.
How does this query infer the "Primary key" of my Dataset? (There is no concept of Primary key in Spark)
You can use either UNION ALL:
Seq((1L, "foo")).toDF.createOrReplaceTempView("a")
Seq((1L, "bar"), (1L, "foo")).toDF.createOrReplaceTempView("b")
spark.sql("SELECT * FROM a UNION ALL SELECT * FROM b").explain
== Physical Plan ==
Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]
or Dataset.union method:
spark.table("a").union(spark.table("b")).explain
== Physical Plan ==
Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]
How does this query infer the "Primary key" of my Dataset?
I doesn't, or at least not in the current version. It just applies HashAggregate using all available columns:
spark.sql("SELECT * FROM a UNION SELECT * FROM b").explain
== Physical Plan ==
*HashAggregate(keys=[_1#152L, _2#153], functions=[])
+- Exchange hashpartitioning(_1#152L, _2#153, 200)
+- *HashAggregate(keys=[_1#152L, _2#153], functions=[])
+- Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]

How to enable Catalyst Query Optimiser in Spark SQL?

Whether I use Spark-SQL directly or Spark-Shell, I have no idea to check the operation of Spark Catalyst Query Optimizer in explicit way.
For example, let we assume that I made HiveContext as follows:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then, when I try to process a query as:
hiveContext.sql("""
| SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
| FROM date_dim dt, store_sales, item
| WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
| AND store_sales.ss_item_sk = item.i_item_sk
| AND item.i_manufact_id = 128
| AND dt.d_moy=11
| GROUP BY dt.d_year, item.i_brand, item.i_brand_id
| ORDER BY dt.d_year, sum_agg desc, brand_id
| LIMIT 100
""").collect().foreach(println)
Is there way to check the existence of catalyst optimizer?
If not exists, then how can we enable catalyst optimizer for HiveContext?
Catalyst Query Optimizer is always enabled in Spark 2.0. It is a part of the optimizations you get for free when you work with Spark 2.0's Datasets (and one of the many reasons you should really be using Datasets before going low level with RDDs).
If you want to see the optimizations Catalyst Query Optimizer applied to your query, use TRACE logging level for SparkOptimizer in conf/log4j.properties:
log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE
With that whenever you trigger execution of your query (through show, collect or a mere explain) you'll see tons of logs with the work Catalyst Query Optimizer is doing for you every time you execute a query.
Let's see Column Pruning optimization rule in action.
// the business object
case class Person(id: Long, name: String, city: String)
// the dataset to query over
val dataset = Seq(Person(0, "Jacek", "Warsaw")).toDS
// the query
// Note that we work with names only (out of 3 attributes in Person)
val query = dataset.groupBy(upper('name) as 'name).count
scala> query.explain(extended = true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L] Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
!+- LocalRelation [id#125L, name#126, city#127] +- Project [name#126]
! +- LocalRelation [id#125L, name#126, city#127]
...
== Parsed Logical Plan ==
'Aggregate [upper('name) AS name#160], [upper('name) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Analyzed Logical Plan ==
name: string, count: bigint
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Optimized Logical Plan ==
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [name#126]
== Physical Plan ==
*HashAggregate(keys=[upper(name#126)#171], functions=[count(1)], output=[name#160, count#166L])
+- Exchange hashpartitioning(upper(name#126)#171, 200)
+- *HashAggregate(keys=[upper(name#126) AS upper(name#126)#171], functions=[partial_count(1)], output=[upper(name#126)#171, count#173L])
+- LocalTableScan [name#126]

Why is Spark SQL in Spark 1.6.1 not using broadcast join in CTAS?

I have a query in Spark SQL which is using broadcast join as expected as my table b is smaller than spark.sql.autoBroadcastJoinThreshold.
However, if I put the exact same select query into an CTAS query then it's NOT using broadcast join for some reason.
The select query looks like this:
select id,name from a join b on a.name = b.bname;
And the explain for this looks this:
== Physical Plan ==
Project [id#1,name#2]
+- BroadcastHashJoin [name#2], [bname#3], BuildRight
:- Scan ParquetRelation: default.a[id#1,name#2] InputPaths: ...
+- ConvertToUnsafe
+- HiveTableScan [bname#3], MetastoreRelation default, b, Some(b)
Then my CTAS looks like this:
create table c as select id,name from a join b on a.name = b.bname;
And the explain for this one returns:
== Physical Plan ==
ExecutedCommand CreateTableAsSelect [Database:default}, TableName: c, InsertIntoHiveTable]
+- Project [id#1,name#2]
+- Join Inner, Some((name#2 = bname#3))
:- Relation[id#1,name#2] ParquetRelation: default.a
+- MetastoreRelation default, b, Some(b)
Is it expected to NOT use broadcast join for the select query that's part of a CTAS query? If not, is there a way to force CTAS to use broadcast join?
If your question is about the reason why Spark creates two different physical plans then this answer won't be helpful. I have observed plenty of sensitivity in Spark's optimizer where the same SQL snippets result in meaningfully different physical plans even if it is not obvious why that is the case.
However, if your question is ultimately about how to execute the CTAS with a broadcast join then here is a simple workaround I have used many times: register the query with the plan you like as a temporary table (or view if you are using the SQL console) and then use SELECT * from tmp_tbl as the query to feed the CTAS.
In other words, something like:
sql("select id, name from a join b on a.name = b.bname").registerTempTable("tmp_joined")
sql("create table c as select * from tmp_joined")

Resources