Related
Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2])
This issue is that the concatenated data frame is not using the cached data but is re-reading the source data. Which in our case is causing an Authentication issue as source.
This was not the behavior we had with pyspark 3.2.3.
This minimal code is able to show the issue.
import pyspark.pandas as ps
import pyspark
from pyspark.sql import SparkSession
import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable
spark = SparkSession.builder.appName('bug-pyspark3.3').getOrCreate()
df1 = ps.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, columns=['col1', 'col2'])
df2 = ps.DataFrame(data={'col3': [5, 6]}, columns=['col3'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
cached_df1.count()
cached_df2.count()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
merged_df.head()
merged_df.spark.explain()
Output of the explain() on pyspark 3.2.3 :
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [(cast(_we0#1300 as bigint) - 1) AS __index_level_0__#1298L, col1#1291L, col2#1292L, col3#1293L]
+- Window [row_number() windowspecdefinition(_w0#1299L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _we0#1300], [_w0#1299L ASC NULLS FIRST]
+- Sort [_w0#1299L ASC NULLS FIRST], false, 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=356]
+- Project [col1#1291L, col2#1292L, col3#1293L, monotonically_increasing_id() AS _w0#1299L]
+- Union
:- Project [col1#941L AS col1#1291L, col2#942L AS col2#1292L, null AS col3#1293L]
: +- InMemoryTableScan [col1#941L, col2#942L]
: +- InMemoryRelation [__index_level_0__#940L, col1#941L, col2#942L, __natural_order__#946L], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(1) Project [__index_level_0__#940L, col1#941L, col2#942L, monotonically_increasing_id() AS __natural_order__#946L]
: +- *(1) Scan ExistingRDD[__index_level_0__#940L,col1#941L,col2#942L]
+- Project [null AS col1#1403L, null AS col2#1404L, col3#952L]
+- InMemoryTableScan [col3#952L]
+- InMemoryRelation [__index_level_0__#951L, col3#952L, __natural_order__#955L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [__index_level_0__#951L, col3#952L, monotonically_increasing_id() AS __natural_order__#955L]
+- *(1) Scan ExistingRDD[__index_level_0__#951L,col3#952L]
We can see that the cache is used in the planned execution (InMemoryTableScan).
Output of the explain() on pyspark 3.3.0 :
== Physical Plan ==
AttachDistributedSequence[__index_level_0__#771L, col1#762L, col2#763L, col3#764L] Index: __index_level_0__#771L
+- Union
:- *(1) Project [col1#412L AS col1#762L, col2#413L AS col2#763L, null AS col3#764L]
: +- *(1) Scan ExistingRDD[__index_level_0__#411L,col1#412L,col2#413L]
+- *(2) Project [null AS col1#804L, null AS col2#805L, col3#423L]
+- *(2) Scan ExistingRDD[__index_level_0__#422L,col3#423L]
We can see on this version of pyspark that the Union is performed by doing a Scan of data instead of performing an InMemoryTableScan
Is this difference normal ? Is there any way to "force" the concat to use the cached dataframes ?
I cannot explain the difference in the planned execution output between pyspark 3.2.3 and 3.3.0, but I believe that despite this difference the cache is being used. I ran some benchmarks with and without caching using an example very similar to yours, and the average time for a merge operation to be performed is shorter when we cache the DataFrames.
def test_merge_without_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
start_time = time.time()
merged_df = ps.concat([df1,df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
def test_merge_with_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
start_time = time.time()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
Here are the printouts from when I ran these two test functions:
total_run_times_without_cache = test_merge_without_cache(n=50, size=10**6)
np.mean(total_run_times_without_cache)
0.12456250190734863
total_run_times_with_cache = test_merge_with_cache(n=50, size=10**6)
np.mean(total_run_times_with_cache)
0.07876112937927246
This isn't the largest difference in speed so it's possible this is just noise and the cache is, in fact, not being used (but I did run this benchmark several times and the merge operation with cache was consistently faster). Someone with a better understanding of pyspark might be able to better explain what you're observing, but hopefully this answer helps a bit.
Here is a plot of the execution time between merge with and without cache:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(y=total_run_times_without_cache, name='without cache'))
fig.add_trace(go.Scatter(y=total_run_times_with_cache, name='with cache'))
So I am doing one-shot encoding in a pipeline and doing the fit method on it.
I have a data frame that has categorical as well as numerical columns, so I have one hot encoded categorical columns using string indexers.
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['IncomeDetails','B2C','Gender','Occupation','POA_Status']
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'target', outputCol = 'label')
stages += [label_stringIdx]
#new_col_array.remove("client_id")
numericCols = new_col_array
numericCols.append('age')
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_df1)
new_df1 = pipelineModel.transform(new_df1)
selectedCols = ['label', 'features'] + cols
I am getting this error :
Py4JJavaError: An error occurred while calling o2053.fit.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(client_id#*****, 200)
+- *(4) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- Exchange hashpartitioning(client_id#*****, 200)
+- *(3) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- *(3) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#27980])
+- Exchange hashpartitioning(client_id#*****, event_name#27993, 200)
+- *(2) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#*****, event_name#27993])
+- *(2) Project [client_id#*****, event_name#27993]
+- *(2) BroadcastHashJoin [client_id#*****], [Party_Code#*****], LeftSemi, BuildRight, false
:- *(2) Project [client_id#*****, event_name#27993]
: +- *(2) Filter isnotnull(client_id#*****)
: +- *(2) FileScan orc dbo.dp_clickstream_[client_id#*****,event_name#27993,dt#28010] Batched: true, Format: ORC, Location: **PrunedInMemoryFileIndex**[s3n://processed/db-dbo-..., PartitionCount: 6, PartitionFilters: [isnotnull(dt#28010), (cast(dt#28010 as timestamp) >= 1610409600000000), (cast(dt#28010 as timest..., PushedFilters: [IsNotNull(client_id)], ReadSchema: struct<client_id:string,event_name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
My Spark version is 2.4.3
Given three different dataframes, df1 and df2, which have the same schema, and df3. The three dataframes have one field in common.
Also consider that df1 and df2 have around 42 million records each and df3 has around 100k records.
What is optimal in spark:
Union df1 and df2, then join with df3?
Join df1 with df3, join df2 with df3, then union these two dataframes?
In all honesty, with these volumes it does not really matter.
Looking at the .explain() on both approaches there is not much in it.
A broadcast join is evident in both cases. In addition union does not cause a
shuffle, at least your question does not imply that, i.e. due to transformations that might cause that.
That is to say, performance is / should be equal. See below, simulated DF approach but demonstration of the points discussed. Mathematically not much in it to decide otherwise.
Approach 1
import org.apache.spark.sql.functions.{sha1, rand, col}
val randomDF1 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF2 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF3 = (spark.range(1, 100000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val u = randomDF1.union(randomDF2)
val u2 = u.join(randomDF3, "id").explain()
== Physical Plan ==
*(4) Project [id#25284L, hash#25296, hash#25326]
+- *(4) BroadcastHashJoin [id#25284L], [id#25314L], Inner, BuildRight
:- Union
: :- *(1) Project [id#25284L, sha1(cast(random_value#25286 as binary)) AS hash#25296]
: : +- *(1) Project [id#25284L, cast(rand(10) as string) AS random_value#25286]
: : +- *(1) Range (1, 42000000, step=1, splits=2)
: +- *(2) Project [id#25299L, sha1(cast(random_value#25301 as binary)) AS hash#25311]
: +- *(2) Project [id#25299L, cast(rand(10) as string) AS random_value#25301]
: +- *(2) Range (1, 42000000, step=1, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13264]
+- *(3) Project [id#25314L, sha1(cast(random_value#25316 as binary)) AS hash#25326]
+- *(3) Project [id#25314L, cast(rand(10) as string) AS random_value#25316]
+- *(3) Range (1, 100000, step=1, splits=2)
Approach 2
import org.apache.spark.sql.functions.{sha1, rand, col}
val randomDF1 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF2 = (spark.range(1, 42000000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val randomDF3 = (spark.range(1, 100000)
.withColumn("random_value", rand(seed=10).cast("string"))
.withColumn("hash", sha1($"random_value"))
.drop("random_value")
).toDF("id", "hash")
val u1 = randomDF1.join(randomDF3, "id")
val u2 = randomDF2.join(randomDF3, "id")
val u3 = u1.union(u2).explain()
== Physical Plan ==
Union
:- *(2) Project [id#25335L, hash#25347, hash#25377]
: +- *(2) BroadcastHashJoin [id#25335L], [id#25365L], Inner, BuildRight
: :- *(2) Project [id#25335L, sha1(cast(random_value#25337 as binary)) AS hash#25347]
: : +- *(2) Project [id#25335L, cast(rand(10) as string) AS random_value#25337]
: : +- *(2) Range (1, 42000000, step=1, splits=2)
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13409]
: +- *(1) Project [id#25365L, sha1(cast(random_value#25367 as binary)) AS hash#25377]
: +- *(1) Project [id#25365L, cast(rand(10) as string) AS random_value#25367]
: +- *(1) Range (1, 100000, step=1, splits=2)
+- *(4) Project [id#25350L, hash#25362, hash#25377]
+- *(4) BroadcastHashJoin [id#25350L], [id#25365L], Inner, BuildRight
:- *(4) Project [id#25350L, sha1(cast(random_value#25352 as binary)) AS hash#25362]
: +- *(4) Project [id#25350L, cast(rand(10) as string) AS random_value#25352]
: +- *(4) Range (1, 42000000, step=1, splits=2)
+- ReusedExchange [id#25365L, hash#25377], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#13409]
Consider the following pyspark code
def transformed_data(spark):
df = spark.read.json('data.json')
df = expensive_transformation(df) # (A)
return df
df1 = transformed_data(spark)
df = transformed_data(spark)
df1 = foo_transform(df1)
df = bar_transform(df)
return df.join(df1)
my question is: are the operations defined as (A) on transformed_data optimized in the final_view, so that it is only performed once?
Note that this code is not equivalent to
df1 = transformed_data(spark)
df = df1
df1 = foo_transform(df1)
df = bar_transform(df)
df.join(df1)
(at least from the Python's point of view, on which id(df1) = id(df) in this case.
The broader question is: what does spark consider when optimizing two equal DAGs: whether the DAGs (as defined by their edges and nodes) are equal, or whether their object ids (df = df1) are equal?
Kinda. It relies on Spark having enough information to infer a dependence.
For instance, I replicated your example as described:
from pyspark.sql.functions import hash
def f(spark, filename):
df=spark.read.csv(filename)
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
filename = 'some-valid-file.csv'
df_a = f(spark, filename)
df_b = f(spark, filename)
assert df_a != df_b
df_joined = df_a.join(df_b, df_a.hashc4==df_b.hashc4, how='left')
If I explain this resulting dataframe using df_joined.explain(extended=True), I see the following four plans:
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hash(hash(_c1#33, 42), 42), 42) AS hashc4#42]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Physical Plan ==
SortMergeJoin [hashc4#20], [hashc4#42], LeftOuter
:- *(2) Sort [hashc4#20 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#20, 200)
: +- *(1) Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- *(1) FileScan csv [_c1#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file: some-valid-file.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c1:string>
+- *(4) Sort [hashc4#42 ASC NULLS FIRST], false, 0
+- ReusedExchange [hashc4#42], Exchange hashpartitioning(hashc4#20, 200)
The physical plan above only reads the CSV once and re-uses all the computation, since Spark detects that the two FileScans are identical (i.e. Spark knows that they are not independent).
Now consider if I replace the read.csv with hand-crafted independent, yet identical RDDs.
from pyspark.sql.functions import hash
def g(spark):
df=spark.createDataFrame([('a', 'a'), ('b', 'b'), ('c', 'c')], ["_c1", "_c2"])
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
df_c = g(spark)
df_d = g(spark)
df_joined = df_c.join(df_d, df_c.hashc4==df_d.hashc4, how='left')
In this case, Spark's physical plan scans two different RDDs. Here's the output of running df_joined.explain(extended=True) to confirm.
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- LogicalRDD [_c1#10, _c2#11], false
== Physical Plan ==
SortMergeJoin [hashc4#8], [hashc4#18], LeftOuter
:- *(2) Sort [hashc4#8 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#8, 200)
: +- *(1) Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- Scan ExistingRDD[_c1#0,_c2#1]
+- *(4) Sort [hashc4#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(hashc4#18, 200)
+- *(3) Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- Scan ExistingRDD[_c1#10,_c2#11]
This isn't really PySpark-specific behaviour.
Environment:
OS: Windows 7
Spark: version 2.1.0
Scala: 2.11.8
Java: 1.8
Spark Shell REPL
scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]
scala> val statC = Seq((1)).toDF("sid").withColumn("stype", lit("A"))
statC: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> val statD = Seq((2)).toDF("sid").withColumn("stype", lit("B"))
statD: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
+---+----+---+-----+
statC and statD generate column stype by WithColumn, the REPL above show that
statC.union(statD) and statD.union(statC) make the join result different.
I explain the Physical Plain of the two join
scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#420 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
+- Union
:- LocalTableScan [sid#420, stype#423]
+- LocalTableScan [sid#430, stype#433]
scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#430 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
+- Union
:- LocalTableScan [sid#430, stype#433]
+- LocalTableScan [sid#420, stype#423]
the explain result show that, the union order of statC and statD make the Filter condition in BroadcastHashJoin different:
when statC.union(statD), the filter condition is:
Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))
when statD.union(statC), the filter condition is:
Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))
But when the two unioned DataFrame are generated without withColumn, the union order has no effect on join result.
scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]
scala> val statA = Seq((1, "A")).toDF("sid", "stype")
statA: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> val statB = Seq((2, "B")).toDF("sid", "stype")
statB: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#352 as double), stype#353], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter isnotnull(_2#340)
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
+- Union
:- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
: +- *Filter isnotnull(_2#350)
: +- LocalTableScan [_1#349, _2#350]
+- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
+- *Filter isnotnull(_2#360)
+- LocalTableScan [_1#359, _2#360]
scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#362 as double), stype#363], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter isnotnull(_2#340)
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
+- Union
:- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
: +- *Filter isnotnull(_2#360)
: +- LocalTableScan [_1#359, _2#360]
+- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
+- *Filter isnotnull(_2#350)
+- LocalTableScan [_1#349, _2#350]
the explain show that in statA/statB, id/type and sid/stype are all included in BroadcastHashJoin, but in statC/statD, only id and sid are included in BroadcastHashJoin.
Why does join has different semantics when the order of union is changed on the DataFrame which are generated by withColumn?