What is Project node in execution query plan? - apache-spark

What is the meaning of Project node in Sparks execution plan?
I have a plan containing the following:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]

NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.
Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.
Quoting Wikipedia's Projection (relational algebra):
In practical terms, it can be roughly thought of as picking a subset of all available columns.
A Project node can appear in a logical query plan explicitly for the following:
Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries
A Project node can also appear analysis and optimization phases.
In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)
You could have a look at the code of Dataset.select operator.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.
NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.
NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.
Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.
Catalyst DSL
You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Good ol' SQL
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

Related

steps in spark physical plan not assigned to DAG step

I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst engine) looks to be corrupted where some of the steps in the physical plan have not been assigned an order id and thus all evaluation on the right side of the join is not completed in the spark query
here is the example query
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')
joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ###
joined_df.explain(True)
here is an example of the physical plan returned by spark
== Physical Plan ==
SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(encntr_id#0, 200)
: +- *(1) Filter isnotnull(encntr_id#0)
: +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#a6df563
+- Sort [encntr_id#12 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(encntr_id#12, 200)
+- Filter isnotnull(encntr_id#12)
+- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#60dd22d9
Note that all datasource scan , filter exchange and sort on the right side of the join have not been assigned an order id .
Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?
Figured this out internally .
Turns out the spark optimization routine can be affected by the configuration setting
spark.sql.codegen.Maxfields
which can have implications in how spark will optimize the read from 'fat' tables .
In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .
Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan

What is SubqueryAlias node in analyzed logical plan?

I have a simple sql as follows,
test("SparkSQLTest") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
spark.range(1, 100).createOrReplaceTempView("t1")
val df = spark.sql("select id from t1 where t1.id = 10")
df.explain(true)
}
The output for the analyzed logical plan is:
== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- Filter (id#0L = cast(10 as bigint))
+- SubqueryAlias t1 ////don't understand here
+- Range (1, 100, step=1, splits=Some(1))
Why does the SubqueryAlias show up int the logical plan? In my sql, I don't have alias related operations.
Could some one help explain? Thanks!
SubqueryAlias is an unary logical operator that gives an alias for the (child) subquery it was created for. The alias can be used in another part of a structured query for a correlated subquery.
SubqueryAlias (and aliases in general) are available until Spark Optimizer has finished query optimization phase (using EliminateSubqueryAliases optimization rule).
Quoting EliminateSubqueryAliases optimization:
Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
In your query the subquery is the part before createOrReplaceTempView("t1").
spark.range(1, 100).createOrReplaceTempView("t1")
You could rewrite the above structured query into the following that would change nothing, but gives a more elaborative explanation.
val q = spark.range(1, 100)
q.createOrReplaceTempView("t1")
So, q could be any other structured query and hence the need for an alias to reference any output attribute from the subquery.
When you explain the query you won't see any SubqueryAlias nodes (and that's not only because the logical query plan gets planned to a physical query plan where physical operators are used).

Spark Plan for collect_list

This is with reference Jacek's answer to How to get the size of result generated using concat_ws?
The DSL query in the answer calls collect_list twice with concat and size separately.
input.groupBy($"col1").agg(
concat_ws(",", collect_list($"COL2".cast("string"))) as "concat",
size(collect_list($"COL2".cast("string"))) as "size"
)
With anoptimizedPlan like :
Aggregate [COL1#9L],
[COL1#9L,
concat_ws(,,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#470a4e26),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS concat#13,
size((hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#2602f45),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS size#14]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
How will it be different performance-wise, if I use collect_list only once and then withColumn API to generate the other two columns?
input
.groupBy("COL1")
.agg(collect_list($"COL2".cast("string")).as("list") )
.withColumn("concat", concat_ws("," , $"list"))
.withColumn("size", size($"list"))
.drop("list")
Which has an optimizedPlan like :
Project [COL1#9L,
concat_ws(,,list#17) AS concat#18,
size(list#17) AS size#19]
+- Aggregate [COL1#9L],
[COL1#9L,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#5cb88b6b),
cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false) AS list#17]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
I see collect_list being called twice in the former example but just wanted to know if there are any significant differences apart from that. Using Spark 1.6.

Does spark.sql.autoBroadcastJoinThreshold work for joins using Dataset's join operator?

I'd like to know if spark.sql.autoBroadcastJoinThreshold property can be useful for broadcasting smaller table on all worker nodes (while making the join) even when the join scheme is using the Dataset API join instead of using Spark SQL.
If my bigger table is 250 Gigs and Smaller is 20 Gigs, do I need to set this config: spark.sql.autoBroadcastJoinThreshold = 21 Gigs (maybe) in order for sending the whole table / Dataset to all worker nodes?
Examples:
Dataset API join
val result = rawBigger.as("b").join(
broadcast(smaller).as("s"),
rawBigger(FieldNames.CAMPAIGN_ID) === smaller(FieldNames.CAMPAIGN_ID),
"left_outer"
)
SQL
select *
from rawBigger_table b, smaller_table s
where b.campign_id = s.campaign_id;
First of all spark.sql.autoBroadcastJoinThreshold and broadcast hint are separate mechanisms. Even if autoBroadcastJoinThreshold is disabled setting broadcast hint will take precedence. With default settings:
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
String = 10485760
val df1 = spark.range(100)
val df2 = spark.range(100)
Spark will use autoBroadcastJoinThreshold and automatically broadcast data:
df1.join(df2, Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=Some(8))
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=Some(8))
When we disable auto broadcast Spark will use standard SortMergeJoin:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df1.join(df2, Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *SortMergeJoin [id#0L], [id#3L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 100, step=1, splits=Some(8))
+- *Sort [id#3L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200)
but can forced to use BroadcastHashJoin with broadcast hint:
df1.join(broadcast(df2), Seq("id")).explain
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=Some(8))
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=Some(8))
SQL has its own hints format (similar to the one used in Hive):
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql(
"SELECT /*+ MAPJOIN(df2) */ * FROM df1 JOIN df2 ON df1.id = df2.id"
).explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=8)
So to answer your question - autoBroadcastJoinThreshold is applicable when working with Dataset API, but it is not relevant when using explicit broadcast hints.
Furthermore broadcasting large objects is unlikely provide any performance boost, and in practice will often degrade performance and result in stability issue. Remember that broadcasted object has to be first fetch to driver, then send to each worker, and finally loaded into memory.
Just to share more details (from the code) to the great answer from #user6910411.
Quoting the source code (formatting mine):
spark.sql.autoBroadcastJoinThreshold configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join.
By setting this value to -1 broadcasting can be disabled.
Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run, and file-based data source tables where the statistics are computed directly on the files of data.
spark.sql.autoBroadcastJoinThreshold defaults to 10M (i.e. 10L * 1024 * 1024) and Spark will check what join to use (see JoinSelection execution planning strategy).
There are 6 different join selections and among them is broadcasting (using BroadcastHashJoinExec or BroadcastNestedLoopJoinExec physical operators).
BroadcastHashJoinExec will get chosen when there are joining keys and one of the following holds:
Join is one of CROSS, INNER, LEFT ANTI, LEFT OUTER, LEFT SEMI and right join side can be broadcast, i.e. the size is less than spark.sql.autoBroadcastJoinThreshold
Join is one of CROSS, INNER and RIGHT OUTER and left join side can be broadcast, i.e. the size is less than spark.sql.autoBroadcastJoinThreshold
BroadcastNestedLoopJoinExec will get chosen when there are no joining keys and one of the above conditions of BroadcastHashJoinExec holds.
In other words, Spark will automatically select the right join, including BroadcastHashJoinExec based on spark.sql.autoBroadcastJoinThreshold property (among other requirements) but also the join type.
there is limitation that maximum size of DF that can be broadcast is 8G.

How to enable Catalyst Query Optimiser in Spark SQL?

Whether I use Spark-SQL directly or Spark-Shell, I have no idea to check the operation of Spark Catalyst Query Optimizer in explicit way.
For example, let we assume that I made HiveContext as follows:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then, when I try to process a query as:
hiveContext.sql("""
| SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
| FROM date_dim dt, store_sales, item
| WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
| AND store_sales.ss_item_sk = item.i_item_sk
| AND item.i_manufact_id = 128
| AND dt.d_moy=11
| GROUP BY dt.d_year, item.i_brand, item.i_brand_id
| ORDER BY dt.d_year, sum_agg desc, brand_id
| LIMIT 100
""").collect().foreach(println)
Is there way to check the existence of catalyst optimizer?
If not exists, then how can we enable catalyst optimizer for HiveContext?
Catalyst Query Optimizer is always enabled in Spark 2.0. It is a part of the optimizations you get for free when you work with Spark 2.0's Datasets (and one of the many reasons you should really be using Datasets before going low level with RDDs).
If you want to see the optimizations Catalyst Query Optimizer applied to your query, use TRACE logging level for SparkOptimizer in conf/log4j.properties:
log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE
With that whenever you trigger execution of your query (through show, collect or a mere explain) you'll see tons of logs with the work Catalyst Query Optimizer is doing for you every time you execute a query.
Let's see Column Pruning optimization rule in action.
// the business object
case class Person(id: Long, name: String, city: String)
// the dataset to query over
val dataset = Seq(Person(0, "Jacek", "Warsaw")).toDS
// the query
// Note that we work with names only (out of 3 attributes in Person)
val query = dataset.groupBy(upper('name) as 'name).count
scala> query.explain(extended = true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L] Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
!+- LocalRelation [id#125L, name#126, city#127] +- Project [name#126]
! +- LocalRelation [id#125L, name#126, city#127]
...
== Parsed Logical Plan ==
'Aggregate [upper('name) AS name#160], [upper('name) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Analyzed Logical Plan ==
name: string, count: bigint
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Optimized Logical Plan ==
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [name#126]
== Physical Plan ==
*HashAggregate(keys=[upper(name#126)#171], functions=[count(1)], output=[name#160, count#166L])
+- Exchange hashpartitioning(upper(name#126)#171, 200)
+- *HashAggregate(keys=[upper(name#126) AS upper(name#126)#171], functions=[partial_count(1)], output=[upper(name#126)#171, count#173L])
+- LocalTableScan [name#126]

Resources