can anyone explain the below behaviour in spark sql join. It does not matter whether I am using full_join/full_outer/left/left_outer, the physical plan always shows that Inner join is being used..
q1 = spark.sql("select count(*) from table_t1 t1 full join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q1.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#19, item_id#20, store_id#23], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#19 ASC NULLS FIRST, item_id#20 ASC NULLS FIRST, store_id#23 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#19, item_id#20, store_id#23, 200)
+- *(3) Project [anchor_page_id#19, item_id#20, store_id#23]
+- *(3) Filter ((isnotnull(anchor_page_id#19) && isnotnull(item_id#20)) && isnotnull(store_id#23))
+- *(3) FileScan parquet table_t1[anchor_page_id#19,item_id#20,store_id#23,date_id#36] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#36), (date_id#36 = 20220324)], PushedFilters: [IsNotNull(anchor_page_id), IsNotNull(item_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q2 = spark.sql("select count(*) from table_t1 t1 full outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q2.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#42, item_id#43, store_id#46], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#42 ASC NULLS FIRST, item_id#43 ASC NULLS FIRST, store_id#46 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#42, item_id#43, store_id#46, 200)
+- *(3) Project [anchor_page_id#42, item_id#43, store_id#46]
+- *(3) Filter ((isnotnull(store_id#46) && isnotnull(anchor_page_id#42)) && isnotnull(item_id#43))
+- *(3) FileScan parquet table_t1[anchor_page_id#42,item_id#43,store_id#46,date_id#59] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#59), (date_id#59 = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(anchor_page_id), IsNotNull(item_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q3 = spark.sql("select count(*) from table_t1 t1 left join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q3.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#65, item_id#66, store_id#69], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#65 ASC NULLS FIRST, item_id#66 ASC NULLS FIRST, store_id#69 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#65, item_id#66, store_id#69, 200)
+- *(3) Project [anchor_page_id#65, item_id#66, store_id#69]
+- *(3) Filter ((isnotnull(item_id#66) && isnotnull(store_id#69)) && isnotnull(anchor_page_id#65))
+- *(3) FileScan parquet table_t1[anchor_page_id#65,item_id#66,store_id#69,date_id#82] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#82), (cast(date_id#82 as int) = 20220324)], PushedFilters: [IsNotNull(item_id), IsNotNull(store_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
q4 = spark.sql("select count(*) from table_t1 t1 left outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q4.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#88, item_id#89, store_id#92], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#88 ASC NULLS FIRST, item_id#89 ASC NULLS FIRST, store_id#92 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#88, item_id#89, store_id#92, 200)
+- *(3) Project [anchor_page_id#88, item_id#89, store_id#92]
+- *(3) Filter ((isnotnull(store_id#92) && isnotnull(item_id#89)) && isnotnull(anchor_page_id#88))
+- *(3) FileScan parquet table_t1[anchor_page_id#88,item_id#89,store_id#92,date_id#105] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#105), (cast(date_id#105 as int) = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(item_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
Full join is full outer join.
A where clause on a form of 'outer join' is converted by Optimizer into an 'inner join'.
A where clause on any 'outer' table will make it an 'inner' table. I.e. only rows where that predicate can be evaluated will pass the filter.
I have two pyspark dataframes dist_stores and dist_brands. I have samples of the two dataframes below. One has a date field and all of the distinct storeid and the other has a date fields and all of the distinct brand id. Both dataframes are created from the same original dataframe smp_train_df. I add a column 'jnky' to smp_train_df with only value 'a' so that I can join the two dataframes dist_stores and dist_brands. My end goal is to create a dataframe with all combinations of storeid and brand id. the max_dt is always the same. When I run the code below to join dist_stores and dist_brands I get the error message below. Does anyone see what the issue is and can you suggest how to fix it? Or is there a better way to accomplish the goal of getting all storeid and brand id combinations?
code:
# get all store brand combos
# getting distinct stores and adding join key
dist_stores=smp_train_df[['storeid','max_dt']].distinct().withColumn('jnky',lit('a'))
# getting distinct brands and adding join key
dist_brands=smp_train_df[['tz_brand_id','max_dt']].distinct().withColumn('jnky',lit('a'))
dist_stores.show()
+-------+----------+----+
|storeid| max_dt|jnky|
+-------+----------+----+
| 85|2020-05-03| a|
| 127|2020-05-03| a|
| 130|2020-05-03| a|
| 87|2020-05-03| a|
| 77|2020-05-03| a|
+-------+----------+----+
dist_brands.show()
+-----------+----------+----+
|tz_brand_id| max_dt|jnky|
+-----------+----------+----+
| 107|2020-05-03| a|
| 3476|2020-05-03| a|
| 3463|2020-05-03| a|
| 358|2020-05-03| a|
| 612|2020-05-03| a|
| 227|2020-05-03| a|
| 3452|2020-05-03| a|
| 36|2020-05-03| a|
| 99|2020-05-03| a|
| 3432|2020-05-03| a|
| 4167|2020-05-03| a|
| 2909|2020-05-03| a|
| 104|2020-05-03| a|
| 141|2020-05-03| a|
| 3618|2020-05-03| a|
| 5290|2020-05-03| a|
| 248|2020-05-03| a|
| 203|2020-05-03| a|
| 3519|2020-05-03| a|
| 221|2020-05-03| a|
+-----------+----------+----+
code:
# getting all combinations of store and brand
store_brand=dist_stores.alias('a')\
.join(dist_brands.alias('b'),
(col('a.jnky')==col('b.jnky')),
how='inner'
)\
.select(col('a.storeid'),
col('a.max_dt'),
col('b.tz_brand_id'))
Error:
An error was encountered:
'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1049, in join
jdf = self._jdf.join(other._jdf, on, how)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'
This is happening due to duplicate column names in both the dataframe. Ensure to rename before performing the join operation.Please use the following
dist_brands = dist_brands.withColumnRenamed('max_dt', 'max_dt_brands')
Then perform the join operation
dist_stores = dist_stores.join(dist_brands, 'jnky', 'left')
dist_stores.show()
Just perform a left join in order to get all the combination. You can further select the column you need likewise
dist_stores = dist_stores.select('storeid', 'max_dt', 'tz_brand_id')