Spark Plan for collect_list - apache-spark

This is with reference Jacek's answer to How to get the size of result generated using concat_ws?
The DSL query in the answer calls collect_list twice with concat and size separately.
input.groupBy($"col1").agg(
concat_ws(",", collect_list($"COL2".cast("string"))) as "concat",
size(collect_list($"COL2".cast("string"))) as "size"
)
With anoptimizedPlan like :
Aggregate [COL1#9L],
[COL1#9L,
concat_ws(,,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#470a4e26),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS concat#13,
size((hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#2602f45),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS size#14]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
How will it be different performance-wise, if I use collect_list only once and then withColumn API to generate the other two columns?
input
.groupBy("COL1")
.agg(collect_list($"COL2".cast("string")).as("list") )
.withColumn("concat", concat_ws("," , $"list"))
.withColumn("size", size($"list"))
.drop("list")
Which has an optimizedPlan like :
Project [COL1#9L,
concat_ws(,,list#17) AS concat#18,
size(list#17) AS size#19]
+- Aggregate [COL1#9L],
[COL1#9L,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#5cb88b6b),
cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false) AS list#17]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
I see collect_list being called twice in the former example but just wanted to know if there are any significant differences apart from that. Using Spark 1.6.

Related

Applying Pandas UDF without shuffling

I am trying to apply a pandas UDF to each partition of a Spark (3.3.0) DataFrame separately so as to avoid any shuffling requirements. However, when I run the query below, a lot of data is getting shuffled around. The execution plan contains a SORT stage; this might be the culprit.
from pyspark.sql.functions import spark_partition_id
query = df.groupBy(spark_partition_id())\
.applyInPandas(lambda x: pd.DataFrame([x.shape]), "n_rows long, n_cols long")
query.explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapGroupsInPandas [SPARK_PARTITION_ID()#1562], <lambda>(id#0L, date#1L, feature#2, partition_id#926)#1561, [nr#1563L, nc#1564L]
+- Sort [SPARK_PARTITION_ID()#1562 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(SPARK_PARTITION_ID()#1562, 200), ENSURE_REQUIREMENTS, [id=#748]
+- Project [SPARK_PARTITION_ID() AS SPARK_PARTITION_ID()#1562, id#0L, date#1L, feature#2, partition_id#926]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
In contrast, if I request the execution plan for a very similar query below, the SORT stage is not there and I detect no shuffling upon execution.
df.groupBy(spark_partition_id()).count().explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[_nondeterministic#1532], functions=[count(1)])
+- Exchange hashpartitioning(_nondeterministic#1532, 200), ENSURE_REQUIREMENTS, [id=#704]
+- HashAggregate(keys=[_nondeterministic#1532], functions=[partial_count(1)])
+- Project [SPARK_PARTITION_ID() AS _nondeterministic#1532]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
What is happening here and how do I achieve the goal I had stated? Thank you!
After some tinkering, it seems I am able to do what I want as follows, although probably it is not ideal.
spark_session.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch",'0')
def get_shape(iterator):
for pdf in iterator:
yield pd.DataFrame([pdf.shape])
df.mapInPandas(get_shape, "nr long, nc long").toPandas()

steps in spark physical plan not assigned to DAG step

I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst engine) looks to be corrupted where some of the steps in the physical plan have not been assigned an order id and thus all evaluation on the right side of the join is not completed in the spark query
here is the example query
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')
joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ###
joined_df.explain(True)
here is an example of the physical plan returned by spark
== Physical Plan ==
SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(encntr_id#0, 200)
: +- *(1) Filter isnotnull(encntr_id#0)
: +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#a6df563
+- Sort [encntr_id#12 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(encntr_id#12, 200)
+- Filter isnotnull(encntr_id#12)
+- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#60dd22d9
Note that all datasource scan , filter exchange and sort on the right side of the join have not been assigned an order id .
Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?
Figured this out internally .
Turns out the spark optimization routine can be affected by the configuration setting
spark.sql.codegen.Maxfields
which can have implications in how spark will optimize the read from 'fat' tables .
In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .
Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan

Spark 1.6 DataFrame optimize join partitioning

I have a question about Spark DataFrame partitioning, I'm currently using Spark 1.6 for project requirements.This is my code excerpt:
sqlContext.getConf("spark.sql.shuffle.partitions") // 6
val df = sc.parallelize(List(("A",1),("A",4),("A",2),("B",5),("C",2),("D",2),("E",2),("B",7),("C",9),("D",1))).toDF("id_1","val_1")
df.rdd.getNumPartitions // 4
val df2 = sc.parallelize(List(("B",1),("E",4),("H",2),("J",5),("C",2),("D",2),("F",2))).toDF("id_2","val_2")
df2.rdd.getNumPartitions // 4
val df3 = df.join(df2,$"id_1" === $"id_2")
df3.rdd.getNumPartitions // 6
val df4 = df3.repartition(3,$"id_1")
df4.rdd.getNumPartitions // 3
df4.explain(true)
The following is the explain plan has been created:
== Parsed Logical Plan ==
'RepartitionByExpression ['id_1], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Analyzed Logical Plan ==
id_1: string, val_1: int, id_2: string, val_2: int
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Optimized Logical Plan ==
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Physical Plan ==
TungstenExchange hashpartitioning(id_1#42,3), None
+- SortMergeJoin [id_1#42], [id_2#46]
:- Sort [id_1#42 ASC], false, 0
: +- TungstenExchange hashpartitioning(id_1#42,6), None
: +- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- Scan ExistingRDD[_1#40,_2#41]
+- Sort [id_2#46 ASC], false, 0
+- TungstenExchange hashpartitioning(id_2#46,6), None
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- Scan ExistingRDD[_1#44,_2#45]
As far I know, DataFrame represent an abstraction interface over RDD, so partitioning should be delegated to the Catalyst optimizer.
Infact compared to RDD where many transformations accept a number of partitions parameter, in order to optimize co-partitioning and co-locating whenever possible, with DataFrame the only chance to alter partitioning, is invoking the method repartition, otherwise the number of partitions for join and aggregations is inferred using the configuration param spark.sql.shuffle.partitions.
From what I can see and understand from the explain plan above it seems there is an useless repartition(so shuffle indeed) to 6 (the default value) after then repartitioning again to the final value imposed by the method repartition.
I believe the Optimizer could change the number of partitions of the join to the final value of 3.
Could someone help me to clarify that point? Maybe I missing something.
If you use spark sql, your shuffle partitions is always equal to spark.sql.shufle.partitions.But if you enable this spark.sql.adaptive.enabled it will add EchangeCoordinator.Right now, the work of this coordinator is to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages.

What is SubqueryAlias node in analyzed logical plan?

I have a simple sql as follows,
test("SparkSQLTest") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
spark.range(1, 100).createOrReplaceTempView("t1")
val df = spark.sql("select id from t1 where t1.id = 10")
df.explain(true)
}
The output for the analyzed logical plan is:
== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- Filter (id#0L = cast(10 as bigint))
+- SubqueryAlias t1 ////don't understand here
+- Range (1, 100, step=1, splits=Some(1))
Why does the SubqueryAlias show up int the logical plan? In my sql, I don't have alias related operations.
Could some one help explain? Thanks!
SubqueryAlias is an unary logical operator that gives an alias for the (child) subquery it was created for. The alias can be used in another part of a structured query for a correlated subquery.
SubqueryAlias (and aliases in general) are available until Spark Optimizer has finished query optimization phase (using EliminateSubqueryAliases optimization rule).
Quoting EliminateSubqueryAliases optimization:
Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
In your query the subquery is the part before createOrReplaceTempView("t1").
spark.range(1, 100).createOrReplaceTempView("t1")
You could rewrite the above structured query into the following that would change nothing, but gives a more elaborative explanation.
val q = spark.range(1, 100)
q.createOrReplaceTempView("t1")
So, q could be any other structured query and hence the need for an alias to reference any output attribute from the subquery.
When you explain the query you won't see any SubqueryAlias nodes (and that's not only because the logical query plan gets planned to a physical query plan where physical operators are used).

What is Project node in execution query plan?

What is the meaning of Project node in Sparks execution plan?
I have a plan containing the following:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]
NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.
Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.
Quoting Wikipedia's Projection (relational algebra):
In practical terms, it can be roughly thought of as picking a subset of all available columns.
A Project node can appear in a logical query plan explicitly for the following:
Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries
A Project node can also appear analysis and optimization phases.
In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)
You could have a look at the code of Dataset.select operator.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.
NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.
NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.
Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.
Catalyst DSL
You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Good ol' SQL
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

Resources