Error TreeNodeException: execute, tree in PipelineModel.transform using Pyspark - apache-spark

So I am doing one-shot encoding in a pipeline and doing the fit method on it.
I have a data frame that has categorical as well as numerical columns, so I have one hot encoded categorical columns using string indexers.
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['IncomeDetails','B2C','Gender','Occupation','POA_Status']
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'target', outputCol = 'label')
stages += [label_stringIdx]
#new_col_array.remove("client_id")
numericCols = new_col_array
numericCols.append('age')
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_df1)
new_df1 = pipelineModel.transform(new_df1)
selectedCols = ['label', 'features'] + cols
I am getting this error :
Py4JJavaError: An error occurred while calling o2053.fit.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(client_id#*****, 200)
+- *(4) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- Exchange hashpartitioning(client_id#*****, 200)
+- *(3) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- *(3) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#27980])
+- Exchange hashpartitioning(client_id#*****, event_name#27993, 200)
+- *(2) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#*****, event_name#27993])
+- *(2) Project [client_id#*****, event_name#27993]
+- *(2) BroadcastHashJoin [client_id#*****], [Party_Code#*****], LeftSemi, BuildRight, false
:- *(2) Project [client_id#*****, event_name#27993]
: +- *(2) Filter isnotnull(client_id#*****)
: +- *(2) FileScan orc dbo.dp_clickstream_[client_id#*****,event_name#27993,dt#28010] Batched: true, Format: ORC, Location: **PrunedInMemoryFileIndex**[s3n://processed/db-dbo-..., PartitionCount: 6, PartitionFilters: [isnotnull(dt#28010), (cast(dt#28010 as timestamp) >= 1610409600000000), (cast(dt#28010 as timest..., PushedFilters: [IsNotNull(client_id)], ReadSchema: struct<client_id:string,event_name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
My Spark version is 2.4.3

Related

Spark SQL physical plan doesn't reuse exchange

I'm trying to optimize the physical plan for the following transformation.
Read data from 'pad' and 'pi'
Find rows in 'pad' that have a reference in 'pi' and transform some columns.
Find rows in 'pad' that don't have a reference in 'pi' and transform some columns.
Merge rows from 2 and 3.
val pad_in_pi = pad
.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "inner"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "pi.PurchaseInvoiceLineNumber as DocumentLineNumber"
, "pi.CodingBlockSequentialNumber"
)
val pad_not_in_pi = pad
.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "anti"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "pad.AccountingDocumentLineNumber as DocumentLineNumber"
, "0001 as CodingBlockSequentialNumber"
)
pad_in_pi.union(pad_not_in_pi)
Branch 2 and 3 use the same join expression, and can thus reuse the exchange. The current physical plan doesn't. What could be the reason?
== Physical Plan ==
Union
:- *(3) Project [AccountingDocumentKeyCode#491, RegionId#539, PurchaseInvoiceLineNumber#205 AS DocumentLineNumber#954, CodingBlockSequentialNumber#203]
: +- *(3) SortMergeJoin [ReferenceKeyCode#538], [PurchaseInvoiceKeyCode#235], Inner
: :- Sort [ReferenceKeyCode#538 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ReferenceKeyCode#538, 200), true, [id=#684]
: : +- *(1) Project [AccountingDocumentKeyCode#491, ReferenceKeyCode#538, RegionId#539]
: : +- *(1) Filter ((isnotnull(RegionId#539) AND (RegionId#539 = R)) AND isnotnull(ReferenceKeyCode#538))
: : +- *(1) ColumnarToRow
: : +- FileScan parquet default.purchaseaccountingdocument_delta[AccountingDocumentKeyCode#491,ReferenceKeyCode#538,RegionId#539] Batched: true, DataFilters: [isnotnull(RegionId#539), (RegionId#539 = R), isnotnull(ReferenceKeyCode#538)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionId), EqualTo(RegionId,R), IsNotNull(ReferenceKeyCode)], ReadSchema: struct<AccountingDocumentKeyCode:string,ReferenceKeyCode:string,RegionId:string>
: +- Sort [PurchaseInvoiceKeyCode#235 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(PurchaseInvoiceKeyCode#235, 200), true, [id=#692]
: +- *(2) Project [CodingBlockSequentialNumber#203, PurchaseInvoiceLineNumber#205, PurchaseInvoiceKeyCode#235]
: +- *(2) Filter ((isnotnull(RegionId#207) AND (RegionId#207 = R)) AND isnotnull(PurchaseInvoiceKeyCode#235))
: +- *(2) ColumnarToRow
: +- FileScan parquet default.purchaseinvoice_delta[CodingBlockSequentialNumber#203,PurchaseInvoiceLineNumber#205,RegionID#207,PurchaseInvoiceKeyCode#235] Batched: true, DataFilters: [isnotnull(RegionID#207), (RegionID#207 = R), isnotnull(PurchaseInvoiceKeyCode#235)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionID), EqualTo(RegionID,R), IsNotNull(PurchaseInvoiceKeyCode)], ReadSchema: struct<CodingBlockSequentialNumber:string,PurchaseInvoiceLineNumber:string,RegionID:string,Purcha...
+- *(6) Project [AccountingDocumentKeyCode#491, RegionId#539, AccountingDocumentLineNumber#492 AS DocumentLineNumber#1208, 1 AS CodingBlockSequentialNumber#1210]
+- SortMergeJoin [ReferenceKeyCode#538], [PurchaseInvoiceKeyCode#235], LeftAnti
:- Sort [ReferenceKeyCode#538 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ReferenceKeyCode#538, 200), true, [id=#703]
: +- *(4) Project [AccountingDocumentKeyCode#491, AccountingDocumentLineNumber#492, ReferenceKeyCode#538, RegionId#539]
: +- *(4) Filter (isnotnull(RegionId#539) AND (RegionId#539 = R))
: +- *(4) ColumnarToRow
: +- FileScan parquet default.purchaseaccountingdocument_delta[AccountingDocumentKeyCode#491,AccountingDocumentLineNumber#492,ReferenceKeyCode#538,RegionId#539] Batched: true, DataFilters: [isnotnull(RegionId#539), (RegionId#539 = R)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionId), EqualTo(RegionId,R)], ReadSchema: struct<AccountingDocumentKeyCode:string,AccountingDocumentLineNumber:string,ReferenceKeyCode:stri...
+- Sort [PurchaseInvoiceKeyCode#235 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(PurchaseInvoiceKeyCode#235, 200), true, [id=#710]
+- *(5) Project [PurchaseInvoiceKeyCode#235]
+- *(5) Filter ((isnotnull(RegionId#207) AND (RegionId#207 = R)) AND isnotnull(PurchaseInvoiceKeyCode#235))
+- *(5) ColumnarToRow
+- FileScan parquet default.purchaseinvoice_delta[RegionID#207,PurchaseInvoiceKeyCode#235] Batched: true, DataFilters: [isnotnull(RegionID#207), (RegionID#207 = R), isnotnull(PurchaseInvoiceKeyCode#235)], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:..., PartitionFilters: [], PushedFilters: [IsNotNull(RegionID), EqualTo(RegionID,R), IsNotNull(PurchaseInvoiceKeyCode)], ReadSchema: struct<RegionID:string,PurchaseInvoiceKeyCode:string>
Not directly answering about exchange reuse, but try a left outer join to get rid of a union:
pad.join(
pi
, $"pad.ReferenceKeyCode" === $"pi.PurchaseInvoiceKeyCode"
, "left_outer"
)
.selectExpr(
"pad.AccountingDocumentKeyCode"
, "pad.RegionId"
, "coalesce(pi.PurchaseInvoiceLineNumber, pad.AccountingDocumentLineNumber) as DocumentLineNumber"
, "coalesce(pi.CodingBlockSequentialNumber, '0001') as CodingBlockSequentialNumber"
)

What is optimal in spark: union then join or join then union?

Given three different dataframes, df1 and df2, which have the same schema, and df3. The three dataframes have one field in common.
Also consider that df1 and df2 have around 42 million records each and df3 has around 100k records.
What is optimal in spark:
Union df1 and df2, then join with df3?
Join df1 with df3, join df2 with df3, then union these two dataframes?
In all honesty, with these volumes it does not really matter.
Looking at the .explain() on both approaches there is not much in it.
A broadcast join is evident in both cases. In addition union does not cause a
shuffle, at least your question does not imply that, i.e. due to transformations that might cause that.
That is to say, performance is / should be equal. See below, simulated DF approach but demonstration of the points discussed. Mathematically not much in it to decide otherwise.
Approach 1
import org.apache.spark.sql.functions.{sha1, rand, col}
val randomDF1 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF2 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF3 = (spark.range(1, 100000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val u = randomDF1.union(randomDF2)
val u2 = u.join(randomDF3, "id").explain()
== Physical Plan ==
*(4) Project [id#25284L, hash#25296, hash#25326]
+- *(4) BroadcastHashJoin [id#25284L], [id#25314L], Inner, BuildRight
:- Union
: :- *(1) Project [id#25284L, sha1(cast(random_value#25286 as binary)) AS hash#25296]
: : +- *(1) Project [id#25284L, cast(rand(10) as string) AS random_value#25286]
: : +- *(1) Range (1, 42000000, step=1, splits=2)
: +- *(2) Project [id#25299L, sha1(cast(random_value#25301 as binary)) AS hash#25311]
: +- *(2) Project [id#25299L, cast(rand(10) as string) AS random_value#25301]
: +- *(2) Range (1, 42000000, step=1, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13264]
+- *(3) Project [id#25314L, sha1(cast(random_value#25316 as binary)) AS hash#25326]
+- *(3) Project [id#25314L, cast(rand(10) as string) AS random_value#25316]
+- *(3) Range (1, 100000, step=1, splits=2)
Approach 2
import org.apache.spark.sql.functions.{sha1, rand, col}
val randomDF1 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF2 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF3 = (spark.range(1, 100000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val u1 = randomDF1.join(randomDF3, "id")
val u2 = randomDF2.join(randomDF3, "id")
val u3 = u1.union(u2).explain()
== Physical Plan ==
Union
:- *(2) Project [id#25335L, hash#25347, hash#25377]
: +- *(2) BroadcastHashJoin [id#25335L], [id#25365L], Inner, BuildRight
: :- *(2) Project [id#25335L, sha1(cast(random_value#25337 as binary)) AS hash#25347]
: : +- *(2) Project [id#25335L, cast(rand(10) as string) AS random_value#25337]
: : +- *(2) Range (1, 42000000, step=1, splits=2)
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13409]
: +- *(1) Project [id#25365L, sha1(cast(random_value#25367 as binary)) AS hash#25377]
: +- *(1) Project [id#25365L, cast(rand(10) as string) AS random_value#25367]
: +- *(1) Range (1, 100000, step=1, splits=2)
+- *(4) Project [id#25350L, hash#25362, hash#25377]
+- *(4) BroadcastHashJoin [id#25350L], [id#25365L], Inner, BuildRight
:- *(4) Project [id#25350L, sha1(cast(random_value#25352 as binary)) AS hash#25362]
: +- *(4) Project [id#25350L, cast(rand(10) as string) AS random_value#25352]
: +- *(4) Range (1, 42000000, step=1, splits=2)
+- ReusedExchange [id#25365L, hash#25377], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13409]

Why is this getting converted to a cross join in spark? [duplicate]

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.
I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Does spark optimize identical but independent DAGs in pyspark?

Consider the following pyspark code
def transformed_data(spark):
df = spark.read.json('data.json')
df = expensive_transformation(df) # (A)
return df
df1 = transformed_data(spark)
df = transformed_data(spark)
df1 = foo_transform(df1)
df = bar_transform(df)
return df.join(df1)
my question is: are the operations defined as (A) on transformed_data optimized in the final_view, so that it is only performed once?
Note that this code is not equivalent to
df1 = transformed_data(spark)
df = df1
df1 = foo_transform(df1)
df = bar_transform(df)
df.join(df1)
(at least from the Python's point of view, on which id(df1) = id(df) in this case.
The broader question is: what does spark consider when optimizing two equal DAGs: whether the DAGs (as defined by their edges and nodes) are equal, or whether their object ids (df = df1) are equal?
Kinda. It relies on Spark having enough information to infer a dependence.
For instance, I replicated your example as described:
from pyspark.sql.functions import hash
def f(spark, filename):
df=spark.read.csv(filename)
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
filename = 'some-valid-file.csv'
df_a = f(spark, filename)
df_b = f(spark, filename)
assert df_a != df_b
df_joined = df_a.join(df_b, df_a.hashc4==df_b.hashc4, how='left')
If I explain this resulting dataframe using df_joined.explain(extended=True), I see the following four plans:
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hash(hash(_c1#33, 42), 42), 42) AS hashc4#42]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Physical Plan ==
SortMergeJoin [hashc4#20], [hashc4#42], LeftOuter
:- *(2) Sort [hashc4#20 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#20, 200)
: +- *(1) Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- *(1) FileScan csv [_c1#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file: some-valid-file.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c1:string>
+- *(4) Sort [hashc4#42 ASC NULLS FIRST], false, 0
+- ReusedExchange [hashc4#42], Exchange hashpartitioning(hashc4#20, 200)
The physical plan above only reads the CSV once and re-uses all the computation, since Spark detects that the two FileScans are identical (i.e. Spark knows that they are not independent).
Now consider if I replace the read.csv with hand-crafted independent, yet identical RDDs.
from pyspark.sql.functions import hash
def g(spark):
df=spark.createDataFrame([('a', 'a'), ('b', 'b'), ('c', 'c')], ["_c1", "_c2"])
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
df_c = g(spark)
df_d = g(spark)
df_joined = df_c.join(df_d, df_c.hashc4==df_d.hashc4, how='left')
In this case, Spark's physical plan scans two different RDDs. Here's the output of running df_joined.explain(extended=True) to confirm.
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- LogicalRDD [_c1#10, _c2#11], false
== Physical Plan ==
SortMergeJoin [hashc4#8], [hashc4#18], LeftOuter
:- *(2) Sort [hashc4#8 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#8, 200)
: +- *(1) Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- Scan ExistingRDD[_c1#0,_c2#1]
+- *(4) Sort [hashc4#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(hashc4#18, 200)
+- *(3) Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- Scan ExistingRDD[_c1#10,_c2#11]
This isn't really PySpark-specific behaviour.

Why does Spark think this is a cross / Cartesian join

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.
I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Resources