create all possible combinations of values in columns from two dataframes - python-3.x

I have two pyspark dataframes dist_stores and dist_brands. I have samples of the two dataframes below. One has a date field and all of the distinct storeid and the other has a date fields and all of the distinct brand id. Both dataframes are created from the same original dataframe smp_train_df. I add a column 'jnky' to smp_train_df with only value 'a' so that I can join the two dataframes dist_stores and dist_brands. My end goal is to create a dataframe with all combinations of storeid and brand id. the max_dt is always the same. When I run the code below to join dist_stores and dist_brands I get the error message below. Does anyone see what the issue is and can you suggest how to fix it? Or is there a better way to accomplish the goal of getting all storeid and brand id combinations?
code:
# get all store brand combos
# getting distinct stores and adding join key
dist_stores=smp_train_df[['storeid','max_dt']].distinct().withColumn('jnky',lit('a'))
# getting distinct brands and adding join key
dist_brands=smp_train_df[['tz_brand_id','max_dt']].distinct().withColumn('jnky',lit('a'))
dist_stores.show()
+-------+----------+----+
|storeid| max_dt|jnky|
+-------+----------+----+
| 85|2020-05-03| a|
| 127|2020-05-03| a|
| 130|2020-05-03| a|
| 87|2020-05-03| a|
| 77|2020-05-03| a|
+-------+----------+----+
dist_brands.show()
+-----------+----------+----+
|tz_brand_id| max_dt|jnky|
+-----------+----------+----+
| 107|2020-05-03| a|
| 3476|2020-05-03| a|
| 3463|2020-05-03| a|
| 358|2020-05-03| a|
| 612|2020-05-03| a|
| 227|2020-05-03| a|
| 3452|2020-05-03| a|
| 36|2020-05-03| a|
| 99|2020-05-03| a|
| 3432|2020-05-03| a|
| 4167|2020-05-03| a|
| 2909|2020-05-03| a|
| 104|2020-05-03| a|
| 141|2020-05-03| a|
| 3618|2020-05-03| a|
| 5290|2020-05-03| a|
| 248|2020-05-03| a|
| 203|2020-05-03| a|
| 3519|2020-05-03| a|
| 221|2020-05-03| a|
+-----------+----------+----+
code:
# getting all combinations of store and brand
store_brand=dist_stores.alias('a')\
.join(dist_brands.alias('b'),
(col('a.jnky')==col('b.jnky')),
how='inner'
)\
.select(col('a.storeid'),
col('a.max_dt'),
col('b.tz_brand_id'))
Error:
An error was encountered:
'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1049, in join
jdf = self._jdf.join(other._jdf, on, how)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/livy/appcache/application_1587410022410_0092/container_1587410022410_0092_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) max_dt#22095 missing from filter_date#407,min_dt#421,tz_brand_id#573,storeid#569,current_date#401,max_dt#429,qty#499,dateclosed#514 in operator !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, (jnky#22085 = jnky#22091)\n:- SubqueryAlias `a`\n: +- Project [storeid#292, max_dt#429, a AS jnky#22085]\n: +- Deduplicate [storeid#292, max_dt#429]\n: +- Project [storeid#292, max_dt#429]\n: +- Project [tz_brand_id#296, min_dt#421, max_dt#429, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n: +- Join LeftOuter, (storeid#292 = storeid#569)\n: :- SubqueryAlias `a`\n: : +- Project [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n: : +- Aggregate [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#429, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n: : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#429))\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: : +- Filter (dateclosed#237 > filter_date#407)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: : +- Filter storeid#292 IN (85,130,77,127,87)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: : +- Filter isnotnull(tz_brand_id#296)\n: : +- Filter NOT (storeid#292 = 230)\n: : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n: : +- Filter (producttype#211 = EDIBLE)\n: : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n: +- SubqueryAlias `b`\n: +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n: +- Aggregate [storeid#569, min_dt#421, max_dt#429], [storeid#569, min_dt#421, max_dt#429, sum(qty#499) AS sum(qty)#446]\n: +- Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#429))\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n: +- Filter (dateclosed#514 > filter_date#407)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n: +- Filter storeid#569 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n: +- Filter isnotnull(tz_brand_id#573)\n: +- Filter NOT (storeid#569 = 230)\n: +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n: +- Filter (producttype#488 = EDIBLE)\n: +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [tz_brand_id#296, max_dt#22095, a AS jnky#22091]\n +- Deduplicate [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, max_dt#22095]\n +- Project [tz_brand_id#296, min_dt#421, max_dt#22095, coalesce((brand_qty#470 / total_qty#452), cast(0 as double)) AS norm_qty#596, storeid#292]\n +- Join LeftOuter, (storeid#292 = storeid#569)\n :- SubqueryAlias `a`\n : +- Project [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty)#463 AS brand_qty#470]\n : +- Aggregate [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296], [storeid#292, min_dt#421, max_dt#22095, tz_brand_id#296, sum(qty#222) AS sum(qty)#463]\n : +- Filter ((dateclosed#237 > min_dt#421) && (dateclosed#237 <= max_dt#22095))\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#22095]\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n : +- Filter (dateclosed#237 > filter_date#407)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n : +- Filter storeid#292 IN (85,130,77,127,87)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n : +- Filter isnotnull(tz_brand_id#296)\n : +- Filter NOT (storeid#292 = 230)\n : +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237]\n : +- Filter (producttype#211 = EDIBLE)\n : +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n +- SubqueryAlias `b`\n +- Project [storeid#569, sum(qty)#446 AS total_qty#452]\n +- !Aggregate [storeid#569, min_dt#421, max_dt#22095], [storeid#569, min_dt#421, max_dt#22095, sum(qty#499) AS sum(qty)#446]\n +- !Filter ((dateclosed#514 > min_dt#421) && (dateclosed#514 <= max_dt#22095))\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, min_dt#421, date_add(filter_date#407, 60) AS max_dt#429]\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, filter_date#407, date_add(filter_date#407, 0) AS min_dt#421]\n +- Filter (dateclosed#514 > filter_date#407)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, current_date#401, date_add(current_date#401, -120) AS filter_date#407]\n +- Filter storeid#569 IN (85,130,77,127,87)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514, to_date(cast(unix_timestamp(2020-07-02 14:57:04, yyyy-MM-dd, None) as timestamp), None) AS current_date#401]\n +- Filter isnotnull(tz_brand_id#573)\n +- Filter NOT (storeid#569 = 230)\n +- Project [tz_brand_id#573, storeid#569, qty#499, dateclosed#514]\n +- Filter (producttype#488 = EDIBLE)\n +- LogicalRDD [cbd_perc#476, thc_perc#477, register#478, customer_type#479, type#480, customer_state#481, customer_city#482, zip_code#483, age#484, age_group#485, cashier#486, approver#487, producttype#488, productsubtype#489, productattributes#490, productbrand#491, productname#492, classification#493, tier#494, weight#495, unitofmeasure#496, size#497, priceunit#498, qty#499, ... 75 more fields], false\n'

This is happening due to duplicate column names in both the dataframe. Ensure to rename before performing the join operation.Please use the following
dist_brands = dist_brands.withColumnRenamed('max_dt', 'max_dt_brands')
Then perform the join operation
dist_stores = dist_stores.join(dist_brands, 'jnky', 'left')
dist_stores.show()
Just perform a left join in order to get all the combination. You can further select the column you need likewise
dist_stores = dist_stores.select('storeid', 'max_dt', 'tz_brand_id')

Related

pyspark: filter values in one dataframe based on array values in another dataframe

I have a pypsark dataframe like this:
| name|segment_list|rung_list |
+--------------------+------------+-----------+
| Campaign 1 | [1.0, 5.0]| [L2, L3]|
| Campaign 1 | [1.1]| [L1]|
| Campaign 2 | [1.2]| [L2]|
| Campaign 2 | [1.1]| [L4, L5]|
+--------------------+------------+-----------+
I have another pyspark dataframe that has segment and rung for every customer:
+-----------+---------------+---------+
|customer_id| segment |rung |
+-----------+---------------+---------+
| 124001823| 1.0| L2|
| 166001989| 5.0| L2|
| 768002266| 1.1| L1|
+-----------+---------------+---------+
What I want is a final output that figures out the customers based on the segment and rung list. The final output should be something like the following:
| name|customer_id |
+--------------------+------------+
| Campaign 1 | 124001823 |
| Campaign 1 | 166001989 |
| Campaign 1 | 768002266 |
+--------------------+------------+
I tried using udf but that approach didnt quite work. I would like to avoid using a for loop on a collect operation or going row by row. So I am primarily looking for a groupby operation on name column.
So I want a better way to do the following:
for row in x.collect():
y = eligible.filter(eligible.segment.isin(row['segment_list'])).filter(eligible.rung.isin(row['rung_list']))
you could try to use array_contains for the join conditions.
here's an example
data1_sdf. \
join(data2_sdf,
func.expr('array_contains(segment_list, segment)') & func.expr('array_contains(rung_list, rung)'),
'left'
). \
select('name', 'customer_id'). \
dropDuplicates(). \
show(truncate=False)
# +----------+-----------+
# |name |customer_id|
# +----------+-----------+
# |Campaign 1|166001989 |
# |Campaign 1|124001823 |
# |Campaign 1|768002266 |
# |Campaign 2|null |
# +----------+-----------
pasting the query plan spark produced
== Parsed Logical Plan ==
Deduplicate [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Analyzed Logical Plan ==
name: string, customer_id: string
Deduplicate [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Optimized Logical Plan ==
Aggregate [name#123, customer_id#129], [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- Filter (isnotnull(segment#130) AND isnotnull(rung#131))
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Physical Plan ==
*(4) HashAggregate(keys=[name#123, customer_id#129], functions=[], output=[name#123, customer_id#129])
+- Exchange hashpartitioning(name#123, customer_id#129, 200), ENSURE_REQUIREMENTS, [id=#267]
+- *(3) HashAggregate(keys=[name#123, customer_id#129], functions=[], output=[name#123, customer_id#129])
+- *(3) Project [name#123, customer_id#129]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- *(1) Scan ExistingRDD[name#123,segment_list#124,rung_list#125]
+- BroadcastExchange IdentityBroadcastMode, [id=#261]
+- *(2) Filter (isnotnull(segment#130) AND isnotnull(rung#131))
+- *(2) Scan ExistingRDD[customer_id#129,segment#130,rung#131]
seems it is not well optimized, I'm thinking there can be other optimized methods.

Spark Query using Inner join instead of full join

can anyone explain the below behaviour in spark sql join. It does not matter whether I am using full_join/full_outer/left/left_outer, the physical plan always shows that Inner join is being used..
q1 = spark.sql("select count(*) from table_t1 t1 full join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q1.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#19, item_id#20, store_id#23], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#19 ASC NULLS FIRST, item_id#20 ASC NULLS FIRST, store_id#23 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#19, item_id#20, store_id#23, 200)
+- *(3) Project [anchor_page_id#19, item_id#20, store_id#23]
+- *(3) Filter ((isnotnull(anchor_page_id#19) && isnotnull(item_id#20)) && isnotnull(store_id#23))
+- *(3) FileScan parquet table_t1[anchor_page_id#19,item_id#20,store_id#23,date_id#36] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#36), (date_id#36 = 20220324)], PushedFilters: [IsNotNull(anchor_page_id), IsNotNull(item_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q2 = spark.sql("select count(*) from table_t1 t1 full outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q2.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#42, item_id#43, store_id#46], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#42 ASC NULLS FIRST, item_id#43 ASC NULLS FIRST, store_id#46 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#42, item_id#43, store_id#46, 200)
+- *(3) Project [anchor_page_id#42, item_id#43, store_id#46]
+- *(3) Filter ((isnotnull(store_id#46) && isnotnull(anchor_page_id#42)) && isnotnull(item_id#43))
+- *(3) FileScan parquet table_t1[anchor_page_id#42,item_id#43,store_id#46,date_id#59] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#59), (date_id#59 = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(anchor_page_id), IsNotNull(item_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q3 = spark.sql("select count(*) from table_t1 t1 left join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q3.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#65, item_id#66, store_id#69], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#65 ASC NULLS FIRST, item_id#66 ASC NULLS FIRST, store_id#69 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#65, item_id#66, store_id#69, 200)
+- *(3) Project [anchor_page_id#65, item_id#66, store_id#69]
+- *(3) Filter ((isnotnull(item_id#66) && isnotnull(store_id#69)) && isnotnull(anchor_page_id#65))
+- *(3) FileScan parquet table_t1[anchor_page_id#65,item_id#66,store_id#69,date_id#82] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#82), (cast(date_id#82 as int) = 20220324)], PushedFilters: [IsNotNull(item_id), IsNotNull(store_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
q4 = spark.sql("select count(*) from table_t1 t1 left outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q4.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#88, item_id#89, store_id#92], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#88 ASC NULLS FIRST, item_id#89 ASC NULLS FIRST, store_id#92 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#88, item_id#89, store_id#92, 200)
+- *(3) Project [anchor_page_id#88, item_id#89, store_id#92]
+- *(3) Filter ((isnotnull(store_id#92) && isnotnull(item_id#89)) && isnotnull(anchor_page_id#88))
+- *(3) FileScan parquet table_t1[anchor_page_id#88,item_id#89,store_id#92,date_id#105] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#105), (cast(date_id#105 as int) = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(item_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
Full join is full outer join.
A where clause on a form of 'outer join' is converted by Optimizer into an 'inner join'.
A where clause on any 'outer' table will make it an 'inner' table. I.e. only rows where that predicate can be evaluated will pass the filter.

Spark joinWith repartitions an already partitioned Dataset

Let's say we have two partitioned datasets
val partitionedPersonDS = personDS.repartition(200, personDS("personId"))
val partitionedTransactionDS = transactionDS.repartition(200, transactionDS("personId"))
And we try to join them using joinWith on the same key over which they are partitioned
val transactionPersonDS: Dataset[(Transaction, Person)] = partitionedTransactionDS
.joinWith(
partitionedPersonDS,
partitionedTransactionDS.col("personId") === partitionedPersonDS.col("personId")
)
The Physical plan shows that the already partitioned Dataset's were repartitioned as part of the Sort Merge Join
InMemoryTableScan [_1#14, _2#15]
+- InMemoryRelation [_1#14, _2#15], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(5) SortMergeJoin [_1#14.personId], [_2#15.personId], Inner
:- *(2) Sort [_1#14.personId ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(_1#14.personId, 200)
: +- *(1) Project [named_struct(transactionId, transactionId#8, personId, personId#9, itemList, itemList#10) AS _1#14]
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(4) Sort [_2#15.personId ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_2#15.personId, 200)
+- *(3) Project [named_struct(personId, personId#2, name, name#3) AS _2#15]
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
But when we perform the join using join the already partitioned Dataset's were NOT repartitioned and only Sorted as part of the Sort Merge Join
val transactionPersonDS: DataFrame = partitionedTransactionDS
.join (
partitionedPersonDS,
partitionedTransactionDS("personId") === partitionedPersonDS("personId")
)
InMemoryTableScan [transactionId#8, personId#9, itemList#10, personId#2, name#3]
+- InMemoryRelation [transactionId#8, personId#9, itemList#10, personId#2, name#3], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [personId#9], [personId#2], Inner
:- *(1) Sort [personId#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(2) Sort [personId#2 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
Why joinWith fails to honor a pre-partitioned dataset unlike join

Understand the plan tree string representation

I have a simple join query:
test("SparkSQLTest 0005") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest 0005").getOrCreate()
spark.range(100, 100000).createOrReplaceTempView("t1")
spark.range(2000, 10000).createOrReplaceTempView("t2")
val df = spark.sql("select count(1) from t1 join t2 on t1.id = t2.id")
df.explain(true)
}
The output is as follows :
I asked 5 questions marked as Q0~Q4 in the output, could some one help explain?Thanks!
== Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)] //Q0, Why the first line has no +- or :-
+- 'Join Inner, ('t1.id = 't2.id) //Q1, What does +- mean
:- 'UnresolvedRelation `t1` //Q2 What does :- mean
+- 'UnresolvedRelation `t2`
== Analyzed Logical Plan ==
count(1): bigint
Aggregate [count(1) AS count(1)#9L]
+- Join Inner, (id#0L = id#2L)
:- SubqueryAlias t1
: +- Range (100, 100000, step=1, splits=Some(1)) //Q3 What does : +- mean?
+- SubqueryAlias t2
+- Range (2000, 10000, step=1, splits=Some(1))
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#9L]
+- Project
+- Join Inner, (id#0L = id#2L)
:- Range (100, 100000, step=1, splits=Some(1)) //Q4 These two Ranges are both Join's children, why one is :- and the other is +-
+- Range (2000, 10000, step=1, splits=Some(1)) //Q4
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#9L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#11L])
+- *(2) Project
+- *(2) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight
:- *(2) Range (100, 100000, step=1, splits=1)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Range (2000, 10000, step=1, splits=1)
They are bullet points simply representing ordered, nested operations
Header
Child 1
Grandchild 1
Child 2
Grandchild 2
Grandchild 3
Child 3
Would be written as
Header
:- Child 1
: +- Grandchild 1
:- Child 2
: :- Grandchild 2
: +- Grandchild 3
+- Child 3
+- A direct child, usually the last
:- A sibling of a direct child, but not the last
: +- The last grandchild, whose parent has a sibling
: :- A grandchild with a sibling, whose parent is non-final and also has a sibling

Performance implications of Spark Pipelines

Using an SQLTransformers we can create new columns in a dataframe and have a Pipeline of these SQLTransformers as well. We can do the same thing using multiple calls to selectExpr methods on dataframes too.
But are the performace optimization metrics that are applied to the selectExpr calls applied to a pipeline of SQLTransformers as well?
For example consider the two snippets of code below:
#Method 1
df = spark.table("transactions")
df = df.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
df = df.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
df.show(10)
#Method 2
df = spark.table("transactions")
trans1 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by account) as acc_sum from __THIS__")
trans2 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by dt) as dt_sum from __THIS__")
pipe = Pipeline(stage[trans1,trans2])
transPipe = pipe.fit(df)
transPipe.transform(df).show(10)
Will the performance for both of these ways of computing the same thing be the same?
Or will there be some extra optimizations that are applied to method 1 that are not used in method 2?
No additional optimizations. As always, when in doubt, check execution plan:
df = spark.createDataFrame([(1, 1, 1)], ("amt", "account", "dt"))
(df
.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
.explain(True))
generates:
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#165]
+- AnalysisBarrier Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L, dt_sum#165L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, acc_sum#158L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
while
trans2.transform(trans1.transform(df)).explain(True)
generates
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#150]
+- 'UnresolvedRelation `SQLTransformer_4318bd7007cefbf17a97_826abb6c003c`
== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L, dt_sum#150L]
+- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- SubqueryAlias sqltransformer_4318bd7007cefbf17a97_826abb6c003c
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, acc_sum#120L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- Project [amt#22L, account#23L, dt#24L]
+- SubqueryAlias sqltransformer_4688bba599a7f5a09c39_f5e9d251099e
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- LogicalRDD [amt#22L, account#23L, dt#24L], false
== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dt#24L, 200)
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
+- *Sort [account#23L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(account#23L, 200)
+- Scan ExistingRDD[amt#22L,account#23L,dt#24L]
As you can see optimized and physical plans are identical.

Resources