Why getting different result between pyspark and sql? - apache-spark

I am trying to translate below sql in pyspark in two different syntax but both code give different output which also not matching with sql output. I am not getting where is the actual difference in these code.
select count(*) from (
select afpo.charg as Batch_Number,
mara1.matkl as Material_Group,
mara1.zzmanu_stg as Mfg_Stage_Code,
mkpf.budat as WCB_261_Posting_Date,
mch1.hsdat as Manufacturing_Date
from
opssup_dev_wrk_sap.src_sap_afpo afpo
inner join opssup_dev_wrk_sap.src_sap_mara mara1 on afpo.matnr=mara1.matnr
inner join opssup_dev_wrk_sap.src_sap_mseg mseg on afpo.aufnr=mseg.aufnr
inner join opssup_dev_wrk_sap.src_sap_mkpf mkpf on mseg.mblnr=mkpf.mblnr
inner join opssup_dev_wrk_sap.src_sap_mara mara on mseg.matnr=mara.matnr
inner join opssup_dev_wrk_sap.src_sap_mch1 mch1 on afpo.charg=mch1.charg
where mara.zzmanu_stg='WCB'
and mseg.bwart='261')
---it's return 2505 Rows
The execution plan of above sql query:
*(15) Project [charg#72 AS Batch_Number#327407, matkl#126 AS Material_Group#327408, zzmanu_stg#275 AS Mfg_Stage_Code#327409, budat#511 AS WCB_261_Posting_Date#327410, hsdat#571 AS Manufacturing_Date#327411]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
:- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#72, 200)
: +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: +- *(11) BroadcastHashJoin [matnr#321], [matnr#327416], Inner, BuildRight, false
: :- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, matnr#321, budat#511]
: : +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313, matnr#321]
: : : +- *(6) ...
I have converted this sql in pyspark as below:
afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')
temp12_df = afpo_df \
.join(mara1_df,(afpo_df.matnr==mara1_df.matnr)) \
.join(mseg_df,(afpo_df.aufnr==mseg_df.aufnr)) \
.join(mkpf_df,(mseg_df.mblnr==mkpf_df.mblnr)) \
.join(mara_df,(mseg_df.matnr==mara_df.matnr)) \
.join(mch1_df,(afpo_df.charg==mch1_df.charg)) \
.filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
.select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))
target_df = temp12_df
print(target_df.count())
returns around 13L rows
Corresponding Query plan for above code:
> == Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#322732, matkl#126 AS Material_Group#322733, zzmanu_stg#275 AS Mfg_Stage_Code#322734, budat#511 AS WCB_261_Posting_Date#322735, hsdat#571 AS Manufacturing_Date#322736]
+- *(15) BroadcastNestedLoopJoin BuildRight, Inner
:- *(15) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511, hsdat#571]
: +- *(15) SortMergeJoin [charg#72], [charg#543], Inner
: :- *(11) Sort [charg#72 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(charg#72, 200)
: : +- *(10) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: : +- *(10) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
: : : +- *(6) SortMergeJoin [aufnr#14, matnr#116], [aufnr#368, matnr#321], Inner
: : : :- *(3) Sort [aufnr#14 ASC NULLS FIRST, matnr#116 ASC NULLS FIRST], false, 0
: : : : +- Exchange hashpartitioning(aufnr#14, matnr#116, 200)
: : : : +- *(2) Project [aufnr#14, charg#72, matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
: : : : :- *(2) Project [aufnr#14, matnr#33, charg#72]
: : : : : +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
: : : : : +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
: : : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: : : : +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(1) Filter isnotnull(matnr#116)
: : : : +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
: : : +- *(5) Sort [aufnr#368 ASC NULLS FIRST, matnr#321 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(aufnr#368, matnr#321, 200)
: : : +- *(4) Project [mblnr#313, matnr#321, aufnr#368]
: : : +- *(4) Filter ((((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(matnr#321)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
: : : +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,matnr#321,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,matnr:string,aufnr:string>
: : +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(mblnr#505, 200)
: : +- *(8) Project [mblnr#505, budat#511]
: : +- *(8) Filter isnotnull(mblnr#505)
: : +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
: +- *(13) Sort [charg#543 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#543, 200)
: +- *(12) Project [charg#543, hsdat#571]
: +- *(12) Filter isnotnull(charg#543)
: +- *(12) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(14) Project
+- *(14) Filter (isnotnull(zzmanu_stg#318210) && (zzmanu_stg#318210 = WCB))
+- *(14) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[zzmanu_stg#318210] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB)], ReadSchema: struct<zzmanu_stg:string>
Again I have tried with
afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')
temp12_df = afpo_df \
.join(mara1_df,"matnr") \
.join(mseg_df,"aufnr") \
.join(mkpf_df,"mblnr") \
.join(mara_df,"matnr") \
.join(mch1_df,"charg") \
.filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
.select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))
target_df = temp12_df
print(target_df.count())
It's returns 1804 rows
The execution plan of above code::
== Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#301751, matkl#126 AS Material_Group#301752, zzmanu_stg#275 AS Mfg_Stage_Code#301753, budat#511 AS WCB_261_Posting_Date#301754, hsdat#571 AS Manufacturing_Date#301755]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
:- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#72, 200)
: +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: +- *(11) BroadcastHashJoin [matnr#33], [matnr#300069], Inner, BuildRight, false
: :- *(11) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, budat#511]
: : +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
: : : +- *(6) SortMergeJoin [aufnr#14], [aufnr#368], Inner
: : : :- *(3) Sort [aufnr#14 ASC NULLS FIRST], false, 0
: : : : +- Exchange hashpartitioning(aufnr#14, 200)
: : : : +- *(2) Project [matnr#33, aufnr#14, charg#72, matkl#126, zzmanu_stg#275]
: : : : +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
: : : : :- *(2) Project [aufnr#14, matnr#33, charg#72]
: : : : : +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
: : : : : +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
: : : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: : : : +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(1) Filter isnotnull(matnr#116)
: : : : +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
: : : +- *(5) Sort [aufnr#368 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(aufnr#368, 200)
: : : +- *(4) Project [mblnr#313, aufnr#368]
: : : +- *(4) Filter (((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
: : : +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,aufnr:string>
: : +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(mblnr#505, 200)
: : +- *(8) Project [mblnr#505, budat#511]
: : +- *(8) Filter isnotnull(mblnr#505)
: : +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
: +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: +- *(10) Project [matnr#300069]
: +- *(10) Filter ((isnotnull(zzmanu_stg#300228) && (zzmanu_stg#300228 = WCB)) && isnotnull(matnr#300069))
: +- *(10) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#300069,zzmanu_stg#300228] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB), IsNotNull(matnr)], ReadSchema: struct<matnr:string,zzmanu_stg:string>
+- *(14) Sort [charg#543 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(charg#543, 200)
+- *(13) Project [charg#543, hsdat#571]
+- *(13) Filter isnotnull(charg#543)
+- *(13) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>
Why this is happening and what is the best way to convert above sql query in pyspark.

Related

DatabricksSQL: package.TreeNodeException: execute, tree: ShuffleQueryStage 26, Statistics(sizeInBytes=21.5 MiB, isRuntime=true)

I have created 5 temp views from pyspark dataframes to run a query performing some joins and other operations like aggregation on numerical columns etc.
When I run the query in SSMS, I'm able to get the desired output but when I run the same in Databricks SQL I get the following error.
> package.TreeNodeException: execute, tree: ShuffleQueryStage 26,
> Statistics(sizeInBytes=21.5 MiB, isRuntime=true)
> +- Exchange hashpartitioning(Product_Line_Description#333640, _groupingexpression#333910, _groupingexpression#333911, _groupingexpression#333912, _groupingexpression#333913, _groupingexpression#333914, 512), ENSURE_REQUIREMENTS, [id=#568961] +- *(26) HashAggregate(keys=[Product_Line_Description#333640, _groupingexpression#333910, knownfloatingpointnormalized(normalizenanandzero(_groupingexpression#333911))
> AS _groupingexpression#333911, _groupingexpression#333912,
> _groupingexpression#333913, knownfloatingpointnormalized(normalizenanandzero(_groupingexpression#333914))
> AS _groupingexpression#333914],
> functions=[partial_sum(Offered_customer#333839) AS sum#333925],
> output=[Product_Line_Description#333640, _groupingexpression#333910,
> _groupingexpression#333911, _groupingexpression#333912, _groupingexpression#333913, _groupingexpression#333914, sum#333925])
> +- Union
> :- *(24) HashAggregate(keys=[Product_Line_Description#333640, Region#333832, Date#333834, Product_Group_Name#333638,
> Business_Type_Name#333641], functions=[finalmerge_sum(merge
> sum#333927) AS sum(cast(Offered_customer#333879 as double))#333887],
> output=[Product_Line_Description#333640, Offered_customer#333839,
> _groupingexpression#333910, _groupingexpression#333911, _groupingexpression#333912, _groupingexpression#333913, _groupingexpression#333914])
> : +- CustomShuffleReader coalesced
> : +- ShuffleQueryStage 23, Statistics(sizeInBytes=17.6 MiB, rowCount=1.62E+5, isRuntime=true)
> : +- Exchange hashpartitioning(Product_Line_Description#333640, Region#333832,
> Date#333834, Product_Group_Name#333638, Business_Type_Name#333641,
> 512), ENSURE_REQUIREMENTS, [id=#568455]
> : +- *(18) HashAggregate(keys=[Product_Line_Description#333640, Region#333832,
> Date#333834, Product_Group_Name#333638, Business_Type_Name#333641],
> functions=[partial_sum(cast(Offered_customer#333879 as double)) AS
> sum#333927], output=[Product_Line_Description#333640, Region#333832,
> Date#333834, Product_Group_Name#333638, Business_Type_Name#333641,
> sum#333927])
> : +- Union
> : :- *(16) Project [Product_Line_Description#333640, AMS AS Region#333832,
> cast(CheckOverflow((promote_precision(Offered#333676) -
> promote_precision(Cleared_Count#333683)), DecimalType(38,0), true) as
> string) AS Offered_customer#333879, partition_calendar_date#333687 AS
> Date#333834, Product_Group_Name#333638, Business_Type_Name#333641]
> : : +- *(16) BroadcastHashJoin [target_virtual_queue#333536], [Resource_Name#333632], Inner,
> BuildRight, false
> : : :- *(16) Project [Offered#333676, Cleared_Count#333683, partition_calendar_date#333687,
> target_virtual_queue#333536]
> : : : +- *(16) BroadcastHashJoin [knownfloatingpointnormalized(normalizenanandzero(cast(Resource_Key#333675
> as double)))],
> [knownfloatingpointnormalized(normalizenanandzero(cast(queue_resource_key#333533
> as double)))], Inner, BuildRight, false
> : : : :- CustomShuffleReader local
> : : : : +- ShuffleQueryStage 0, Statistics(sizeInBytes=285.4 MiB, rowCount=4.16E+6, isRuntime=true)
> : : : : +- Exchange hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(Resource_Key#333675
> as double))), 512), ENSURE_REQUIREMENTS, [id=#565246]
> : : : : +- *(1) Filter isnotnull(Resource_Key#333675)
> : : : : +- *(1) ColumnarToRow
> : : : : +- FileScan parquet
> [Resource_Key#333675,Offered#333676,Cleared_Count#333683,partition_calendar_date#333687]
> Batched: true, DataFilters: [isnotnull(Resource_Key#333675)], Format:
> Parquet, Location:
> PreparedDeltaFileIndex[abfss://transformed-domain-data#ebproebiadls.dfs.core.windows.net/services...,
> PartitionFilters: [], PushedFilters: [IsNotNull(Resource_Key)],
> ReadSchema:
> struct<Resource_Key:decimal(10,0),Offered:decimal(38,0),Cleared_Count:decimal(38,0),partition_cal...
> : : : +- BroadcastQueryStage 16, Statistics(sizeInBytes=64.1 MiB, rowCount=3.21E+3, isRuntime=true)
> : : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(knownfloatingpointnormalized(normalizenanandzero(cast(input[0,
> string, true] as double)))),false), [id=#566599]
> : : : +- CustomShuffleReader local
> : : : +- ShuffleQueryStage 1, Statistics(sizeInBytes=228.4 KiB, rowCount=3.21E+3, isRuntime=true)
> : : : +- Exchange hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(queue_resource_key#333533
> as double))), 512), ENSURE_REQUIREMENTS, [id=#565266]
> : : : +- *(2) Project [queue_resource_key#333533, target_virtual_queue#333536]
> : : : +- *(2) Filter ((isnotnull(resource_type_name#333535) AND
> isnotnull(queue_resource_key#333533)) AND
> isnotnull(target_virtual_queue#333536))
> : : : +- *(2) ColumnarToRow
> : : : +- FileScan parquet
> [queue_resource_key#333533,resource_type_name#333535,target_virtual_queue#333536]
> Batched: true, DataFilters: [isnotnull(resource_type_name#333535),
> isnotnull(queue_resource_key#333533), isnotnull(target_vir..., Format:
> Parquet, Location:
> InMemoryFileIndex[dbfs:/innovation-services#ebstgebiadls.dfs.core.windows.net/rr-call-volume-fore...,
> PartitionFilters: [], PushedFilters: [IsNotNull(resource_type_name),
> IsNotNull(queue_resource_key), IsNotNull(target_virtual_queue)],
> ReadSchema:
> struct<queue_resource_key:string,resource_type_name:string,target_virtual_queue:string>
> : : +- BroadcastQueryStage 15, Statistics(sizeInBytes=64.0 MiB, rowCount=312, isRuntime=true)
> : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string,
> true]),false), [id=#566096]
> : : +- CustomShuffleReader local
> : : +- ShuffleQueryStage 2, Statistics(sizeInBytes=40.5 KiB, rowCount=312, isRuntime=true)
> : : +- Exchange hashpartitioning(Resource_Name#333632, 512), ENSURE_REQUIREMENTS,
> [id=#565292]
> : : +- *(3) Project [Resource_Name#333632, Product_Group_Name#333638,
> Product_Line_Description#333640, Business_Type_Name#333641]
> : : +- *(3) Filter ((((((((isnotnull(Region_Code#333634) AND
> isnotnull(Forecasting_flag#333645)) AND
> isnotnull(Product_Line_Description#333640)) AND (Region_Code#333634 =
> AMS)) AND (Forecasting_flag#333645 = Yes)) AND NOT
> Contains(Product_Line_Description#333640, LAR)) AND NOT
> Contains(Product_Line_Description#333640, Brazil)) AND NOT
> Contains(Product_Line_Description#333640, MX)) AND
> isnotnull(Resource_Name#333632))
> : : +- *(3) ColumnarToRow
> : : +- FileScan parquet
> [Resource_Name#333632,Region_Code#333634,Product_Group_Name#333638,Product_Line_Description#333640,Business_Type_Name#333641,Forecasting_flag#333645] Batched: true, DataFilters: [isnotnull(Region_Code#333634),
> isnotnull(Forecasting_flag#333645), isnotnull(Product_Line_Descri...,
> Format: Parquet, Location:
> InMemoryFileIndex[dbfs:/innovation-services#ebstgebiadls.dfs.core.windows.net/rr-call-volume-fore...,
> PartitionFilters: [], PushedFilters: [IsNotNull(Region_Code),
> IsNotNull(Forecasting_flag), IsNotNull(Product_Line_Description),
> EqualT..., ReadSchema:
> struct<Resource_Name:string,Region_Code:string,Product_Group_Name:string,Product_Line_Description...
> : +- *(17) Project [Product_Line_Description#333591, AMS AS Region#333868,
> Offered_customer#333592, cast(Calendar_Date#333589 as date) AS
> Date#333870, Product_Group_Name#333638, Business_Type_Name#333641]
Note: I have read in some forums that the issue could be due to AutoBroadcast and I tried disabling the same with the following code but the issue still persists.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
It would be really helpful if anyone can guide me on how to resolve this issue.

Spark SQL physical plan doesn't reuse exchange

I'm trying to optimize the physical plan for the following transformation.
Read data from 'pad' and 'pi'
Find rows in 'pad' that have a reference in 'pi' and transform some columns.
Find rows in 'pad' that don't have a reference in 'pi' and transform some columns.
Merge rows from 2 and 3.
val pad_in_pi = pad
.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "inner"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "pi.PurchaseInvoiceLineNumber as DocumentLineNumber"
, "pi.CodingBlockSequentialNumber"
)
val pad_not_in_pi = pad
.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "anti"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "pad.AccountingDocumentLineNumber as DocumentLineNumber"
, "0001 as CodingBlockSequentialNumber"
)
pad_in_pi.union(pad_not_in_pi)
Branch 2 and 3 use the same join expression, and can thus reuse the exchange. The current physical plan doesn't. What could be the reason?
== Physical Plan ==
Union
:- *(3) Project [AccountingDocumentKeyCode#491, RegionId#539, PurchaseInvoiceLineNumber#205 AS DocumentLineNumber#954, CodingBlockSequentialNumber#203]
: +- *(3) SortMergeJoin [ReferenceKeyCode#538], [PurchaseInvoiceKeyCode#235], Inner
: :- Sort [ReferenceKeyCode#538 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ReferenceKeyCode#538, 200), true, [id=#684]
: : +- *(1) Project [AccountingDocumentKeyCode#491, ReferenceKeyCode#538, RegionId#539]
: : +- *(1) Filter ((isnotnull(RegionId#539) AND (RegionId#539 = R)) AND isnotnull(ReferenceKeyCode#538))
: : +- *(1) ColumnarToRow
: : +- FileScan parquet default.purchaseaccountingdocument_delta[AccountingDocumentKeyCode#491,ReferenceKeyCode#538,RegionId#539] Batched: true, DataFilters: [isnotnull(RegionId#539), (RegionId#539 = R), isnotnull(ReferenceKeyCode#538)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionId), EqualTo(RegionId,R), IsNotNull(ReferenceKeyCode)], ReadSchema: struct<AccountingDocumentKeyCode:string,ReferenceKeyCode:string,RegionId:string>
: +- Sort [PurchaseInvoiceKeyCode#235 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(PurchaseInvoiceKeyCode#235, 200), true, [id=#692]
: +- *(2) Project [CodingBlockSequentialNumber#203, PurchaseInvoiceLineNumber#205, PurchaseInvoiceKeyCode#235]
: +- *(2) Filter ((isnotnull(RegionId#207) AND (RegionId#207 = R)) AND isnotnull(PurchaseInvoiceKeyCode#235))
: +- *(2) ColumnarToRow
: +- FileScan parquet default.purchaseinvoice_delta[CodingBlockSequentialNumber#203,PurchaseInvoiceLineNumber#205,RegionID#207,PurchaseInvoiceKeyCode#235] Batched: true, DataFilters: [isnotnull(RegionID#207), (RegionID#207 = R), isnotnull(PurchaseInvoiceKeyCode#235)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionID), EqualTo(RegionID,R), IsNotNull(PurchaseInvoiceKeyCode)], ReadSchema: struct<CodingBlockSequentialNumber:string,PurchaseInvoiceLineNumber:string,RegionID:string,Purcha...
+- *(6) Project [AccountingDocumentKeyCode#491, RegionId#539, AccountingDocumentLineNumber#492 AS DocumentLineNumber#1208, 1 AS CodingBlockSequentialNumber#1210]
+- SortMergeJoin [ReferenceKeyCode#538], [PurchaseInvoiceKeyCode#235], LeftAnti
:- Sort [ReferenceKeyCode#538 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ReferenceKeyCode#538, 200), true, [id=#703]
: +- *(4) Project [AccountingDocumentKeyCode#491, AccountingDocumentLineNumber#492, ReferenceKeyCode#538, RegionId#539]
: +- *(4) Filter (isnotnull(RegionId#539) AND (RegionId#539 = R))
: +- *(4) ColumnarToRow
: +- FileScan parquet default.purchaseaccountingdocument_delta[AccountingDocumentKeyCode#491,AccountingDocumentLineNumber#492,ReferenceKeyCode#538,RegionId#539] Batched: true, DataFilters: [isnotnull(RegionId#539), (RegionId#539 = R)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionId), EqualTo(RegionId,R)], ReadSchema: struct<AccountingDocumentKeyCode:string,AccountingDocumentLineNumber:string,ReferenceKeyCode:stri...
+- Sort [PurchaseInvoiceKeyCode#235 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(PurchaseInvoiceKeyCode#235, 200), true, [id=#710]
+- *(5) Project [PurchaseInvoiceKeyCode#235]
+- *(5) Filter ((isnotnull(RegionId#207) AND (RegionId#207 = R)) AND isnotnull(PurchaseInvoiceKeyCode#235))
+- *(5) ColumnarToRow
+- FileScan parquet default.purchaseinvoice_delta[RegionID#207,PurchaseInvoiceKeyCode#235] Batched: true, DataFilters: [isnotnull(RegionID#207), (RegionID#207 = R), isnotnull(PurchaseInvoiceKeyCode#235)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionID), EqualTo(RegionID,R), IsNotNull(PurchaseInvoiceKeyCode)], ReadSchema: struct<RegionID:string,PurchaseInvoiceKeyCode:string>
Not directly answering about exchange reuse, but try a left outer join to get rid of a union:
pad.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "left_outer"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "coalesce(pi.PurchaseInvoiceLineNumber, pad.AccountingDocumentLineNumber) as DocumentLineNumber"
, "coalesce(pi.CodingBlockSequentialNumber, '0001') as CodingBlockSequentialNumber"
)

Why is this getting converted to a cross join in spark? [duplicate]

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.
I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Optimizing large join in PySpark

I am currently working on the StackOverflow dataset from Google BigQuery Public datasets.
I want to find which are the top users by numQuestions/Answers in a given country and tag (perhaps even with their question favs, answer score etc)
I think the query would be sped up if I first compute a data set with userid, tag, numQuestions, numAnswers, etc
So I did
usersQuestionsDf = spark.sql("""
SELECT uid, tag, COUNT(*) AS numQuestions, SUM(favorite_count) AS favs
FROM (
SELECT u.id AS uid, explode(q.tags) AS tag, q.favorite_count
FROM usersFixedCountries u
INNER JOIN questions q
ON q.owner_user_id = u.id
WHERE country IS NOT NULL
)
GROUP BY uid, tag
""")
usersAnswersDf = spark.sql("""
SELECT uid, tag, SUM(score) AS score, COUNT(*) AS numAnswers
FROM (
SELECT u.id AS uid, explode(q.tags) AS tag, a.score
FROM usersFixedCountries u
INNER JOIN answers a
ON a.owner_user_id = u.id
INNER JOIN questions q
ON q.id = a.parent_id
WHERE country is NOT NULL
)
GROUP BY uid, tag
""")
Then I tried to do:
usersAnswersDf.createOrReplaceTempView("usersAnswers")
usersQuestionsDf.createOrReplaceTempView("usersQuestions")
usersTagScoreDf = spark.sql("""
SELECT q.uid, q.tag, numAnswers, score AS answersScore, numQuestions, favs AS questionsFavs, u.display_name, u.up_votes
FROM usersAnswers a
FULL OUTER JOIN usersQuestions q
ON q.uid = a.uid
AND q.tag = a.tag
INNER JOIN usersFixedCountries u
ON u.id = q.uid
WHERE u.country IS NOT NULL
""")
The problem is the last part throws an error. I think it ran out of memory. So I am wondering how might I optimize this? Maybe my query is just inefficient?
An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
+- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
: +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
: +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
: +- Exchange hashpartitioning(uid#175, tag#180, 200)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
: +- *Project [id#82 AS uid#175, tag#180, score#33]
: +- Generate explode(tags#2), true, false, [tag#180]
: +- *Project [id#82, score#33, tags#2]
: +- *SortMergeJoin [parent_id#32], [id#0], Inner
: :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(parent_id#32, 200)
: : +- *Project [id#82, parent_id#32, score#33]
: : +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
: : :- *Sort [id#82 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#82, 200)
: : : +- *Project [id#82]
: : : +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: : : +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
: : +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(owner_user_id#31, 200)
: : +- *Project [owner_user_id#31, parent_id#32, score#33]
: : +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
: : +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- *Project [id#0, tags#2]
: +- *Filter isnotnull(id#0)
: +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
+- *Filter isnotnull(uid#350)
+- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
+- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
+- Exchange hashpartitioning(uid#112, tag#117, 200)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
+- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
+- Generate explode(tags#2), true, false, [tag#117]
+- *Project [id#82, tags#2, favorite_count#9]
+- *SortMergeJoin [id#82], [owner_user_id#3], Inner
:- *Sort [id#82 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#82, 200)
: +- *Project [id#82]
: +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
+- *Sort [owner_user_id#3 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(owner_user_id#3, 200)
+- *Project [tags#2, owner_user_id#3, favorite_count#9]
+- *Filter isnotnull(owner_user_id#3)
+- *FileScan parquet [tags#2,owner_user_id#3,favorite_count#9,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id)], ReadSchema: struct<tags:array<string>,owner_user_id:int,favorite_count:int>
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegenExec.scala:244)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.joins.SortMergeJoinExec.inputRDDs(SortMergeJoinExec.scala:377)
at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:235)
at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:263)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:235)
at org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:128)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:46)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:331)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:372)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 55 more
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 336, in show
print(self._jdf.showString(n, 20))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
+- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
: +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
: +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
: +- Exchange hashpartitioning(uid#175, tag#180, 200)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
: +- *Project [id#82 AS uid#175, tag#180, score#33]
: +- Generate explode(tags#2), true, false, [tag#180]
: +- *Project [id#82, score#33, tags#2]
: +- *SortMergeJoin [parent_id#32], [id#0], Inner
: :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(parent_id#32, 200)
: : +- *Project [id#82, parent_id#32, score#33]
: : +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
: : :- *Sort [id#82 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#82, 200)
: : : +- *Project [id#82]
: : : +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: : : +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
: : +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(owner_user_id#31, 200)
: : +- *Project [owner_user_id#31, parent_id#32, score#33]
: : +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
: : +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- *Project [id#0, tags#2]
: +- *Filter isnotnull(id#0)
: +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
+- *Filter isnotnull(uid#350)
+- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
+- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
+- Exchange hashpartitioning(uid#112, tag#117, 200)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
+- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
+- Generate explode(tags#2), true, false, [tag#117]
+- *Project [id#82, tags#2, favorite_count#9]
+- *SortMergeJoin [id#82], [owner_user_id#3], Inner
:- *Sort [id#82 ASC NULLS FIRST], false, 0
: +- Exchange

Why does Spark think this is a cross / Cartesian join

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.
I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Resources