Performance implications of Spark Pipelines

Performance implications of Spark Pipelines - apache-spark

Using an SQLTransformers we can create new columns in a dataframe and have a Pipeline of these SQLTransformers as well. We can do the same thing using multiple calls to selectExpr methods on dataframes too.
But are the performace optimization metrics that are applied to the selectExpr calls applied to a pipeline of SQLTransformers as well?
For example consider the two snippets of code below:
#Method 1
df = spark.table("transactions")
df = df.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
df = df.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
df.show(10)
#Method 2
df = spark.table("transactions")
trans1 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by account) as acc_sum from __THIS__")
trans2 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by dt) as dt_sum from __THIS__")
pipe = Pipeline(stage[trans1,trans2])
transPipe = pipe.fit(df)
transPipe.transform(df).show(10)
Will the performance for both of these ways of computing the same thing be the same?
Or will there be some extra optimizations that are applied to method 1 that are not used in method 2?

No additional optimizations. As always, when in doubt, check execution plan:
df = spark.createDataFrame([(1, 1, 1)], ("amt", "account", "dt"))
(df
.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
.explain(True))
generates:
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#165]
+- AnalysisBarrier Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L, dt_sum#165L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, acc_sum#158L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
while
trans2.transform(trans1.transform(df)).explain(True)
generates
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#150]
+- 'UnresolvedRelation `SQLTransformer_4318bd7007cefbf17a97_826abb6c003c`
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L, dt_sum#150L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- SubqueryAlias sqltransformer_4318bd7007cefbf17a97_826abb6c003c
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, acc_sum#120L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- SubqueryAlias sqltransformer_4688bba599a7f5a09c39_f5e9d251099e
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
As you can see optimized and physical plans are identical.

Related

Spark Query using Inner join instead of full join

can anyone explain the below behaviour in spark sql join. It does not matter whether I am using full_join/full_outer/left/left_outer, the physical plan always shows that Inner join is being used..
q1 = spark.sql("select count(*) from table_t1 t1 full join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q1.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#19, item_id#20, store_id#23], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#19 ASC NULLS FIRST, item_id#20 ASC NULLS FIRST, store_id#23 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#19, item_id#20, store_id#23, 200)
+- *(3) Project [anchor_page_id#19, item_id#20, store_id#23]
+- *(3) Filter ((isnotnull(anchor_page_id#19) && isnotnull(item_id#20)) && isnotnull(store_id#23))
+- *(3) FileScan parquet table_t1[anchor_page_id#19,item_id#20,store_id#23,date_id#36] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#36), (date_id#36 = 20220324)], PushedFilters: [IsNotNull(anchor_page_id), IsNotNull(item_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q2 = spark.sql("select count(*) from table_t1 t1 full outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q2.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#42, item_id#43, store_id#46], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#42 ASC NULLS FIRST, item_id#43 ASC NULLS FIRST, store_id#46 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#42, item_id#43, store_id#46, 200)
+- *(3) Project [anchor_page_id#42, item_id#43, store_id#46]
+- *(3) Filter ((isnotnull(store_id#46) && isnotnull(anchor_page_id#42)) && isnotnull(item_id#43))
+- *(3) FileScan parquet table_t1[anchor_page_id#42,item_id#43,store_id#46,date_id#59] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#59), (date_id#59 = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(anchor_page_id), IsNotNull(item_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q3 = spark.sql("select count(*) from table_t1 t1 left join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q3.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#65, item_id#66, store_id#69], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#65 ASC NULLS FIRST, item_id#66 ASC NULLS FIRST, store_id#69 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#65, item_id#66, store_id#69, 200)
+- *(3) Project [anchor_page_id#65, item_id#66, store_id#69]
+- *(3) Filter ((isnotnull(item_id#66) && isnotnull(store_id#69)) && isnotnull(anchor_page_id#65))
+- *(3) FileScan parquet table_t1[anchor_page_id#65,item_id#66,store_id#69,date_id#82] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#82), (cast(date_id#82 as int) = 20220324)], PushedFilters: [IsNotNull(item_id), IsNotNull(store_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
q4 = spark.sql("select count(*) from table_t1 t1 left outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q4.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#88, item_id#89, store_id#92], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#88 ASC NULLS FIRST, item_id#89 ASC NULLS FIRST, store_id#92 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#88, item_id#89, store_id#92, 200)
+- *(3) Project [anchor_page_id#88, item_id#89, store_id#92]
+- *(3) Filter ((isnotnull(store_id#92) && isnotnull(item_id#89)) && isnotnull(anchor_page_id#88))
+- *(3) FileScan parquet table_t1[anchor_page_id#88,item_id#89,store_id#92,date_id#105] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#105), (cast(date_id#105 as int) = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(item_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>

Full join is full outer join.
A where clause on a form of 'outer join' is converted by Optimizer into an 'inner join'.
A where clause on any 'outer' table will make it an 'inner' table. I.e. only rows where that predicate can be evaluated will pass the filter.

Spark joinWith repartitions an already partitioned Dataset

Let's say we have two partitioned datasets
val partitionedPersonDS = personDS.repartition(200, personDS("personId"))
val partitionedTransactionDS = transactionDS.repartition(200, transactionDS("personId"))
And we try to join them using joinWith on the same key over which they are partitioned
val transactionPersonDS: Dataset[(Transaction, Person)] = partitionedTransactionDS
.joinWith(
partitionedPersonDS,
partitionedTransactionDS.col("personId") === partitionedPersonDS.col("personId")
)
The Physical plan shows that the already partitioned Dataset's were repartitioned as part of the Sort Merge Join
InMemoryTableScan [_1#14, _2#15]
+- InMemoryRelation [_1#14, _2#15], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(5) SortMergeJoin [_1#14.personId], [_2#15.personId], Inner
:- *(2) Sort [_1#14.personId ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(_1#14.personId, 200)
: +- *(1) Project [named_struct(transactionId, transactionId#8, personId, personId#9, itemList, itemList#10) AS _1#14]
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(4) Sort [_2#15.personId ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_2#15.personId, 200)
+- *(3) Project [named_struct(personId, personId#2, name, name#3) AS _2#15]
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
But when we perform the join using join the already partitioned Dataset's were NOT repartitioned and only Sorted as part of the Sort Merge Join
val transactionPersonDS: DataFrame = partitionedTransactionDS
.join (
partitionedPersonDS,
partitionedTransactionDS("personId") === partitionedPersonDS("personId")
)
InMemoryTableScan [transactionId#8, personId#9, itemList#10, personId#2, name#3]
+- InMemoryRelation [transactionId#8, personId#9, itemList#10, personId#2, name#3], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [personId#9], [personId#2], Inner
:- *(1) Sort [personId#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(2) Sort [personId#2 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
Why joinWith fails to honor a pre-partitioned dataset unlike join

create all possible combinations of values in columns from two dataframes

I have two pyspark dataframes dist_stores and dist_brands. I have samples of the two dataframes below. One has a date field and all of the distinct storeid and the other has a date fields and all of the distinct brand id. Both dataframes are created from the same original dataframe smp_train_df. I add a column 'jnky' to smp_train_df with only value 'a' so that I can join the two dataframes dist_stores and dist_brands. My end goal is to create a dataframe with all combinations of storeid and brand id. the max_dt is always the same. When I run the code below to join dist_stores and dist_brands I get the error message below. Does anyone see what the issue is and can you suggest how to fix it? Or is there a better way to accomplish the goal of getting all storeid and brand id combinations?
code:
# get all store brand combos
# getting distinct stores and adding join key
dist_stores=smp_train_df[['storeid','max_dt']].distinct().withColumn('jnky',lit('a'))
# getting distinct brands and adding join key
dist_brands=smp_train_df[['tz_brand_id','max_dt']].distinct().withColumn('jnky',lit('a'))
dist_stores.show()
+-------+----------+----+
|storeid| max_dt|jnky|
+-------+----------+----+
| 85|2020-05-03| a|
| 127|2020-05-03| a|
| 130|2020-05-03| a|
| 87|2020-05-03| a|
| 77|2020-05-03| a|
+-------+----------+----+
dist_brands.show()
+-----------+----------+----+
|tz_brand_id| max_dt|jnky|
+-----------+----------+----+
| 107|2020-05-03| a|
| 3476|2020-05-03| a|
| 3463|2020-05-03| a|
| 358|2020-05-03| a|
| 612|2020-05-03| a|
| 227|2020-05-03| a|
| 3452|2020-05-03| a|
| 36|2020-05-03| a|
| 99|2020-05-03| a|
| 3432|2020-05-03| a|
| 4167|2020-05-03| a|
| 2909|2020-05-03| a|
| 104|2020-05-03| a|
| 141|2020-05-03| a|
| 3618|2020-05-03| a|
| 5290|2020-05-03| a|
| 248|2020-05-03| a|
| 203|2020-05-03| a|
| 3519|2020-05-03| a|
| 221|2020-05-03| a|
+-----------+----------+----+
code:
# getting all combinations of store and brand
store_brand=dist_stores.alias('a')\
.join(dist_brands.alias('b'),
(col('a.jnky')==col('b.jnky')),
how='inner'
)\
.select(col('a.storeid'),
col('a.max_dt'),
col('b.tz_brand_id'))
Error:
An error was encountered:
'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1049, in join
jdf = self._jdf.join(other._jdf, on, how)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'

This is happening due to duplicate column names in both the dataframe. Ensure to rename before performing the join operation.Please use the following
dist_brands = dist_brands.withColumnRenamed('max_dt', 'max_dt_brands')
Then perform the join operation
dist_stores = dist_stores.join(dist_brands, 'jnky', 'left')
dist_stores.show()
Just perform a left join in order to get all the combination. You can further select the column you need likewise
dist_stores = dist_stores.select('storeid', 'max_dt', 'tz_brand_id')

Understand the plan tree string representation

I have a simple join query:
test("SparkSQLTest 0005") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest 0005").getOrCreate()
spark.range(100, 100000).createOrReplaceTempView("t1")
spark.range(2000, 10000).createOrReplaceTempView("t2")
val df = spark.sql("select count(1) from t1 join t2 on t1.id = t2.id")
df.explain(true)
}
The output is as follows :
I asked 5 questions marked as Q0~Q4 in the output, could some one help explain?Thanks!
== Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)] //Q0, Why the first line has no +- or :-
+- 'Join Inner, ('t1.id = 't2.id) //Q1, What does +- mean
:- 'UnresolvedRelation `t1` //Q2 What does :- mean
+- 'UnresolvedRelation `t2`
== Analyzed Logical Plan ==
count(1): bigint
Aggregate [count(1) AS count(1)#9L]
+- Join Inner, (id#0L = id#2L)
:- SubqueryAlias t1
: +- Range (100, 100000, step=1, splits=Some(1)) //Q3 What does : +- mean?
+- SubqueryAlias t2
+- Range (2000, 10000, step=1, splits=Some(1))
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#9L]
+- Project
+- Join Inner, (id#0L = id#2L)
:- Range (100, 100000, step=1, splits=Some(1)) //Q4 These two Ranges are both Join's children, why one is :- and the other is +-
+- Range (2000, 10000, step=1, splits=Some(1)) //Q4
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#9L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#11L])
+- *(2) Project
+- *(2) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight
:- *(2) Range (100, 100000, step=1, splits=1)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Range (2000, 10000, step=1, splits=1)

They are bullet points simply representing ordered, nested operations
Header
Child 1
Grandchild 1
Child 2
Grandchild 2
Grandchild 3
Child 3
Would be written as
Header
:- Child 1
: +- Grandchild 1
:- Child 2
: :- Grandchild 2
: +- Grandchild 3
+- Child 3
+- A direct child, usually the last
:- A sibling of a direct child, but not the last
: +- The last grandchild, whose parent has a sibling
: :- A grandchild with a sibling, whose parent is non-final and also has a sibling

Spark result size is too big - usecase

my input data is stored in Cassandra and I use a table which primary key is by year,month,day,hour as a source for Spark aggregations.
My Spark application does
Join two tables
Take joined tables and select data by hour
Union selected chunks by hour
Do aggregations on result Dataset and save to Cassandra
Simplifying
val ds1 = spark.read.cassandraFormat(table1, keyspace).load().as[T]
val ds2 = spark.read.cassandraFormat(table2, keyspace).load().as[T]
val dsInput = ds1.join(ds2).coalesce(150)
val dsUnion = for (x <- hours) yield dsInput.select( where hour = x)
val dsResult = mySparkAggregation( dsUnion.reduce(_.union(_)).coalesce(10) )
dsResult.saveToCassadnra
`
The result diagram looks like this (for 3 hours/unions)
Everything works ok when I do only couple of unions e.g 24 (for one day) but when I started running that Spark job for 1 month (720 unions) than I started getting such an error
Total size of serialized results of 1126 tasks (1024.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The other alarming thing is that the job creates ~100k tasks and one of the stages (the one which caused the error above) contains 74400 tasks and when it processes 1125 it crashes because of maxResultSize. What is more it seems like it has to shuffle data for each hour (union).
I tried to coalesce the number of tasks after union - than it says that the task is too big.
I would be very grateful for any help, suggestions ? I have feeling that I do something wrong.
I did some investigation and got some conclusion
Let's say we have two tables
cb.people
CREATE TABLE cb.people (
id text PRIMARY KEY,
name text
)
and
cb.address
CREATE TABLE cb.address (
people_id text PRIMARY KEY,
name text
)
with the following data
cassandra#cqlsh> select * from cb.people;
id | name
----+---------
3 | Mariusz
2 | Monica
1 | John
cassandra#cqlsh> select * from cb.address;
people_id | name
-----------+--------
3 | POLAND
2 | USA
1 | USA
Now I would like to get joined result for id 1 and 2. There are two possible solutions.
Union two select for id 1 and 2 from table people and then join with the address table
scala> val people = spark.read.cassandraFormat("people", "cb").load()
scala> val usPeople = people.where(col("id") === "1") union people.where(col("id") === "2")
scala> val address = spark.read.cassandraFormat("address", "cb").load()
scala> val joined = usPeople.join(address, address.col("people_id") === usPeople.col("id"))
Join two tables and then union two select for id 1 and 2
scala> val peopleAddress = address.join(usPeople, address.col("people_id") === usPeople.col("id"))
scala> val joined2 = peopleAddress.where(col("id") === "1") union peopleAddress.where(col("id") === "2")
both return the same result
+---------+----+---+------+
|people_id|name| id| name|
+---------+----+---+------+
| 1| USA| 1| John|
| 2| USA| 2|Monica|
+---------+----+---+------+
But looking at the explain I can see a big difference
scala> joined.explain
== Physical Plan ==
*SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter (((people_id#10 = 1) || (people_id#10 = 2)) && isnotnull(people_id#10))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [Or(EqualTo(people_id,1),EqualTo(people_id,2)), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter isnotnull(id#0)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
scala> joined2.explain
== Physical Plan ==
Union
:- *SortMergeJoin [people_id#10], [id#0], Inner
: :- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(people_id#10, 200)
: : +- *Filter isnotnull(people_id#10)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [*EqualTo(people_id,1), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- Union
: :- *Filter isnotnull(id#0)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
: +- *Filter (isnotnull(id#0) && (id#0 = 1))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2), EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter isnotnull(people_id#10)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [IsNotNull(people_id), *EqualTo(people_id,2)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter (isnotnull(id#0) && (id#0 = 2))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1), EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
Now it's quite clear for me that what I did was this joined2 version when in the loop for each union there was called join. I though that Spark would be enough smart to reduce that to the first version...
Now current graph looks much more better
I hope that other people will not make the same mistake I made :) Unfortunately I covered spark with my abstraction level which covers that simple problem so spark-shell helped a lot to model the problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Performance implications of Spark Pipelines - apache-spark

Related

Spark Query using Inner join instead of full join

Spark joinWith repartitions an already partitioned Dataset

create all possible combinations of values in columns from two dataframes

Understand the plan tree string representation

Spark result size is too big - usecase

Categories

Resources