How to make Spark use partition information from Parquet files?

How to make Spark use partition information from Parquet files? - python-3.x

I am trying to precompute partitions for some SparkSql queries. If I compute and persist the the partitions, Spark uses them. If I save the partitioned data to Parquet and reload it later, the partition information is gone and Spark will recompute it.
The actual data is large enough that significant time is spent partitioning. The code below demonstrates the problems sufficiently though. Test2() is currently the only thing that I can get to work, but I would like to jumpstart the actual processing, which is what test3() is attempting to do.
Anyone know what I'm doing wrong? ..or if this is something that Spark can do?
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# NOTE: Need to have python in PATH, SPARK_HOME set to location of spark, HADOOP_HOME set to location of winutils
if __name__ == "__main__":
sc = SparkContext(appName="PythonPartitionBug")
sql_text = "select foo, bar from customer c, orders o where foo < 300 and c.custkey=o.custkey"
def setup():
sqlContext = SQLContext(sc)
fields1 = [StructField(name, IntegerType()) for name in ['custkey', 'foo']]
data1 = [(1, 110), (2, 210), (3, 310), (4, 410), (5, 510)]
df1 = sqlContext.createDataFrame(data1, StructType(fields1))
df1.persist()
fields2 = [StructField(name, IntegerType()) for name in ['orderkey', 'custkey', 'bar']]
data2 = [(1, 1, 10), (2, 1, 20), (3, 2, 30), (4, 3, 40), (5, 4, 50)]
df2 = sqlContext.createDataFrame(data2, StructType(fields2))
df2.persist()
return sqlContext, df1, df2
def test1():
# Without repartition the final plan includes hashpartitioning
# == Physical Plan ==
# Project [foo#1,bar#14]
# +- SortMergeJoin [custkey#0], [custkey#13]
# :- Sort [custkey#0 ASC], false, 0
# : +- TungstenExchange hashpartitioning(custkey#0,200), None
# : +- Filter (foo#1 < 300)
# : +- InMemoryColumnarTableScan [custkey#0,foo#1], [(foo#1 < 300)], InMemoryRelation [custkey#0,foo#1], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
# +- Sort [custkey#13 ASC], false, 0
# +- TungstenExchange hashpartitioning(custkey#13,200), None
# +- InMemoryColumnarTableScan [bar#14,custkey#13], InMemoryRelation [orderkey#12,custkey#13,bar#14], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
sqlContext, df1, df2 = setup()
df1.registerTempTable("customer")
df2.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
def test2():
# With repartition the final plan does not include hashpartitioning
# == Physical Plan ==
# Project [foo#56,bar#69]
# +- SortMergeJoin [custkey#55], [custkey#68]
# :- Sort [custkey#55 ASC], false, 0
# : +- Filter (foo#56 < 300)
# : +- InMemoryColumnarTableScan [custkey#55,foo#56], [(foo#56 < 300)], InMemoryRelation [custkey#55,foo#56], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#55,4), None, None
# +- Sort [custkey#68 ASC], false, 0
# +- InMemoryColumnarTableScan [bar#69,custkey#68], InMemoryRelation [orderkey#67,custkey#68,bar#69], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#68,4), None, None
sqlContext, df1, df2 = setup()
df1a = df1.repartition(4, 'custkey').persist()
df1a.registerTempTable("customer")
df2a = df2.repartition(4, 'custkey').persist()
df2a.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
def test3():
# After round tripping the partitioned data, the partitioning is lost and spark repartitions
# == Physical Plan ==
# Project [foo#223,bar#284]
# +- SortMergeJoin [custkey#222], [custkey#283]
# :- Sort [custkey#222 ASC], false, 0
# : +- TungstenExchange hashpartitioning(custkey#222,200), None
# : +- Filter (foo#223 < 300)
# : +- InMemoryColumnarTableScan [custkey#222,foo#223], [(foo#223 < 300)], InMemoryRelation [custkey#222,foo#223], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[custkey#222,foo#223] InputPaths: file:/E:/.../df1.parquet, None
# +- Sort [custkey#283 ASC], false, 0
# +- TungstenExchange hashpartitioning(custkey#283,200), None
# +- InMemoryColumnarTableScan [bar#284,custkey#283], InMemoryRelation [orderkey#282,custkey#283,bar#284], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[orderkey#282,custkey#283,bar#284] InputPaths: file:/E:/.../df2.parquet, None
sqlContext, df1, df2 = setup()
df1a = df1.repartition(4, 'custkey').persist()
df1a.write.parquet("df1.parquet", mode='overwrite')
df1a = sqlContext.read.parquet("df1.parquet")
df1a.persist()
df1a.registerTempTable("customer")
df2a = df2.repartition(4, 'custkey').persist()
df2a.write.parquet("df2.parquet", mode='overwrite')
df2a = sqlContext.read.parquet("df2.parquet")
df2a.persist()
df2a.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
test1()
test2()
test3()
sc.stop()

You're doing nothing wrong - but you can't achieve what you're trying to achieve with Spark: the partitioner used to save a file is necessarily lost when writing to disk. Why? Because Spark doesn't have its own file format, it relies on existing formats (e.g. Parquet, ORC or text files), none of which are even aware of the Partitioner (which is internal to Spark) so they can't persist that information. The data is properly partitioned on disk, but Spark has no way of knowing what partitioner was used when it loads from disk, so it has no choice but to re-partition.
The reason test2() doesn't reveal this is that you reuse the same DataFrame instances, which do store the partitioning information (in memory).

A better solution would be to use persist(StorageLevel.MEMORY_AND_DISK_ONLY) which will spill the RDD/DF partitions to the Worker's local disk if they're evicted from memory. In this case, rebuilding a partition only requires pulling data from the Worker's local disk which is relatively fast.

Related

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2])
This issue is that the concatenated data frame is not using the cached data but is re-reading the source data. Which in our case is causing an Authentication issue as source.
This was not the behavior we had with pyspark 3.2.3.
This minimal code is able to show the issue.
import pyspark.pandas as ps
import pyspark
from pyspark.sql import SparkSession
import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable
spark = SparkSession.builder.appName('bug-pyspark3.3').getOrCreate()
df1 = ps.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, columns=['col1', 'col2'])
df2 = ps.DataFrame(data={'col3': [5, 6]}, columns=['col3'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
cached_df1.count()
cached_df2.count()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
merged_df.head()
merged_df.spark.explain()
Output of the explain() on pyspark 3.2.3 :
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [(cast(_we0#1300 as bigint) - 1) AS __index_level_0__#1298L, col1#1291L, col2#1292L, col3#1293L]
+- Window [row_number() windowspecdefinition(_w0#1299L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _we0#1300], [_w0#1299L ASC NULLS FIRST]
+- Sort [_w0#1299L ASC NULLS FIRST], false, 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=356]
+- Project [col1#1291L, col2#1292L, col3#1293L, monotonically_increasing_id() AS _w0#1299L]
+- Union
:- Project [col1#941L AS col1#1291L, col2#942L AS col2#1292L, null AS col3#1293L]
: +- InMemoryTableScan [col1#941L, col2#942L]
: +- InMemoryRelation [__index_level_0__#940L, col1#941L, col2#942L, __natural_order__#946L], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(1) Project [__index_level_0__#940L, col1#941L, col2#942L, monotonically_increasing_id() AS __natural_order__#946L]
: +- *(1) Scan ExistingRDD[__index_level_0__#940L,col1#941L,col2#942L]
+- Project [null AS col1#1403L, null AS col2#1404L, col3#952L]
+- InMemoryTableScan [col3#952L]
+- InMemoryRelation [__index_level_0__#951L, col3#952L, __natural_order__#955L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [__index_level_0__#951L, col3#952L, monotonically_increasing_id() AS __natural_order__#955L]
+- *(1) Scan ExistingRDD[__index_level_0__#951L,col3#952L]
We can see that the cache is used in the planned execution (InMemoryTableScan).
Output of the explain() on pyspark 3.3.0 :
== Physical Plan ==
AttachDistributedSequence[__index_level_0__#771L, col1#762L, col2#763L, col3#764L] Index: __index_level_0__#771L
+- Union
:- *(1) Project [col1#412L AS col1#762L, col2#413L AS col2#763L, null AS col3#764L]
: +- *(1) Scan ExistingRDD[__index_level_0__#411L,col1#412L,col2#413L]
+- *(2) Project [null AS col1#804L, null AS col2#805L, col3#423L]
+- *(2) Scan ExistingRDD[__index_level_0__#422L,col3#423L]
We can see on this version of pyspark that the Union is performed by doing a Scan of data instead of performing an InMemoryTableScan
Is this difference normal ? Is there any way to "force" the concat to use the cached dataframes ?

I cannot explain the difference in the planned execution output between pyspark 3.2.3 and 3.3.0, but I believe that despite this difference the cache is being used. I ran some benchmarks with and without caching using an example very similar to yours, and the average time for a merge operation to be performed is shorter when we cache the DataFrames.
def test_merge_without_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
start_time = time.time()
merged_df = ps.concat([df1,df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
def test_merge_with_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
start_time = time.time()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
Here are the printouts from when I ran these two test functions:
total_run_times_without_cache = test_merge_without_cache(n=50, size=10**6)
np.mean(total_run_times_without_cache)
0.12456250190734863
total_run_times_with_cache = test_merge_with_cache(n=50, size=10**6)
np.mean(total_run_times_with_cache)
0.07876112937927246
This isn't the largest difference in speed so it's possible this is just noise and the cache is, in fact, not being used (but I did run this benchmark several times and the merge operation with cache was consistently faster). Someone with a better understanding of pyspark might be able to better explain what you're observing, but hopefully this answer helps a bit.
Here is a plot of the execution time between merge with and without cache:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(y=total_run_times_without_cache, name='without cache'))
fig.add_trace(go.Scatter(y=total_run_times_with_cache, name='with cache'))

What is ExternalRDDScan in the DAG?

What is the meaning of ExternalRDDScan in the DAG?
The whole internet doesn't have an explanation for it.

Based on the source, ExternalRDDScan is a representation of converting existing RDD of arbitrary objects to a dataset of InternalRows, i.e. creating a DataFrame. Let's verify that our understanding is correct:
scala> import spark.implicits._
import spark.implicits._
scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> rdd.toDF().explain()
== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]

ExternalRDD is a logical representation of DataFrame/Dataset (not in all cases though) in the query execution plan i.e. in DAG created by the spark.
ExternalRDD(s) are created
when you create a DataFrame from RDD (viz. using createDataFrame(), toDF() )
when you create a DataSet from RDD (viz. using createDataSet(), toDS() )
At runtime, when the ExternalRDD is to be loaded into the memory, a scan operation is done which is represented by ExternalRDDScan (internally the scan strategy is resolved to ExternalRDDScanExec). Look at the example below:
scala> val sampleRDD = sc.parallelize(Seq(1,2,3,4,5))
sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> sampleRDD.toDF.queryExecution
res0: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
SerializeFromObject [input[0, int, false] AS value#2]
+- ExternalRDD [obj#1]
== Analyzed Logical Plan ==
value: int
SerializeFromObject [input[0, int, false] AS value#2]
+- ExternalRDD [obj#1]
== Optimized Logical Plan ==
SerializeFromObject [input[0, int, false] AS value#2]
+- ExternalRDD [obj#1]
== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan[obj#1]
You can see that in the query execution plan, the DataFrame object is
represented by ExternalRDD and the physical plan contains a scan
operation which is resolved to ExternalRDDScan (ExternalRDDScanExec)
during its execution.
The same holds true for a spark Dataset as well.
scala> val sampleRDD = sc.parallelize(Seq(1,2,3,4,5))
sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> sampleRDD.toDS.queryExecution.logical
res9: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
SerializeFromObject [input[0, int, false] AS value#23]
+- ExternalRDD [obj#22]
scala> spark.createDataset(sampleRDD).queryExecution.logical
res18: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
SerializeFromObject [input[0, int, false] AS value#39]
+- ExternalRDD [obj#38]
The above examples were run in spark version 2.4.2
Reference: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-LogicalPlan-ExternalRDD.html

Why is this getting converted to a cross join in spark? [duplicate]

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.

I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))

Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Does spark optimize identical but independent DAGs in pyspark?

Consider the following pyspark code
def transformed_data(spark):
df = spark.read.json('data.json')
df = expensive_transformation(df) # (A)
return df
df1 = transformed_data(spark)
df = transformed_data(spark)
df1 = foo_transform(df1)
df = bar_transform(df)
return df.join(df1)
my question is: are the operations defined as (A) on transformed_data optimized in the final_view, so that it is only performed once?
Note that this code is not equivalent to
df1 = transformed_data(spark)
df = df1
df1 = foo_transform(df1)
df = bar_transform(df)
df.join(df1)
(at least from the Python's point of view, on which id(df1) = id(df) in this case.
The broader question is: what does spark consider when optimizing two equal DAGs: whether the DAGs (as defined by their edges and nodes) are equal, or whether their object ids (df = df1) are equal?

Kinda. It relies on Spark having enough information to infer a dependence.
For instance, I replicated your example as described:
from pyspark.sql.functions import hash
def f(spark, filename):
df=spark.read.csv(filename)
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
filename = 'some-valid-file.csv'
df_a = f(spark, filename)
df_b = f(spark, filename)
assert df_a != df_b
df_joined = df_a.join(df_b, df_a.hashc4==df_b.hashc4, how='left')
If I explain this resulting dataframe using df_joined.explain(extended=True), I see the following four plans:
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hashc3#18, 42) AS hashc4#20]
: +- Project [hash(hashc2#16, 42) AS hashc3#18]
: +- Project [hash(_c1#11, 42) AS hashc2#16]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hashc3#40, 42) AS hashc4#42]
+- Project [hash(hashc2#38, 42) AS hashc3#40]
+- Project [hash(_c1#33, 42) AS hashc2#38]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#20 = hashc4#42)
:- Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- Relation[_c0#10,_c1#11,_c2#12] csv
+- Project [hash(hash(hash(_c1#33, 42), 42), 42) AS hashc4#42]
+- Relation[_c0#32,_c1#33,_c2#34] csv
== Physical Plan ==
SortMergeJoin [hashc4#20], [hashc4#42], LeftOuter
:- *(2) Sort [hashc4#20 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#20, 200)
: +- *(1) Project [hash(hash(hash(_c1#11, 42), 42), 42) AS hashc4#20]
: +- *(1) FileScan csv [_c1#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file: some-valid-file.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c1:string>
+- *(4) Sort [hashc4#42 ASC NULLS FIRST], false, 0
+- ReusedExchange [hashc4#42], Exchange hashpartitioning(hashc4#20, 200)
The physical plan above only reads the CSV once and re-uses all the computation, since Spark detects that the two FileScans are identical (i.e. Spark knows that they are not independent).
Now consider if I replace the read.csv with hand-crafted independent, yet identical RDDs.
from pyspark.sql.functions import hash
def g(spark):
df=spark.createDataFrame([('a', 'a'), ('b', 'b'), ('c', 'c')], ["_c1", "_c2"])
df2=df.select(hash('_c1').alias('hashc2'))
df3=df2.select(hash('hashc2').alias('hashc3'))
df4=df3.select(hash('hashc3').alias('hashc4'))
return df4
df_c = g(spark)
df_d = g(spark)
df_joined = df_c.join(df_d, df_c.hashc4==df_d.hashc4, how='left')
In this case, Spark's physical plan scans two different RDDs. Here's the output of running df_joined.explain(extended=True) to confirm.
== Parsed Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Analyzed Logical Plan ==
hashc4: int, hashc4: int
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hashc3#6, 42) AS hashc4#8]
: +- Project [hash(hashc2#4, 42) AS hashc3#6]
: +- Project [hash(_c1#0, 42) AS hashc2#4]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hashc3#16, 42) AS hashc4#18]
+- Project [hash(hashc2#14, 42) AS hashc3#16]
+- Project [hash(_c1#10, 42) AS hashc2#14]
+- LogicalRDD [_c1#10, _c2#11], false
== Optimized Logical Plan ==
Join LeftOuter, (hashc4#8 = hashc4#18)
:- Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- LogicalRDD [_c1#0, _c2#1], false
+- Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- LogicalRDD [_c1#10, _c2#11], false
== Physical Plan ==
SortMergeJoin [hashc4#8], [hashc4#18], LeftOuter
:- *(2) Sort [hashc4#8 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(hashc4#8, 200)
: +- *(1) Project [hash(hash(hash(_c1#0, 42), 42), 42) AS hashc4#8]
: +- Scan ExistingRDD[_c1#0,_c2#1]
+- *(4) Sort [hashc4#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(hashc4#18, 200)
+- *(3) Project [hash(hash(hash(_c1#10, 42), 42), 42) AS hashc4#18]
+- Scan ExistingRDD[_c1#10,_c2#11]
This isn't really PySpark-specific behaviour.

Why does Spark think this is a cross / Cartesian join

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.

I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))

Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to make Spark use partition information from Parquet files? - python-3.x

A better solution would be to use persist(StorageLevel.MEMORY_AND_DISK_ONLY) which will spill the RDD/DF partitions to the Worker's local disk if they're evicted from memory. In this case, rebuilding a partition only requires pulling data from the Worker's local disk which is relatively fast.

Related

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

What is ExternalRDDScan in the DAG?

Why is this getting converted to a cross join in spark? [duplicate]

Does spark optimize identical but independent DAGs in pyspark?

Why does Spark think this is a cross / Cartesian join

Categories

Resources