Create an intermediate calculated column or expand the definition

Create an intermediate calculated column or expand the definition - apache-spark

Is there material difference (based on how Spark is implemented) between:
tempColumn = commonColumnExpression
df = (
df.withColumn('tempColumn', tempColumn)
df.withColumn('newColumn1', col('existingColumn') + col('tempColumn')
df.withColumn('newColumn2', col('existingColumn') - col('tempColumn')
)
and
tempColumnDef = commonColumnExpression
df = (
df.withColumn('tempColumn', tempColumn)
df.withColumn('newColumn1', col('existingColumn') + tempColumnDef)
df.withColumn('newColumn2', col('existingColumn') - tempColumnDef)
)

No, there's no difference. You can check it through df.explain(). explain shows you what was the actual sequence of steps Spark decided to take (physical plan) after interpreting your code. In your case, we can see that both physical plans are identical (internal column IDs don't matter).
from pyspark.sql.functions import col, lit
df = spark.createDataFrame([(5,),], ['existingColumn'])
commonColumnExpression = lit(2)
tempColumn = commonColumnExpression
df1 = (
df.withColumn('tempColumn', tempColumn)
.withColumn('newColumn1', col('existingColumn') + col('tempColumn'))
.withColumn('newColumn2', col('existingColumn') - col('tempColumn'))
)
tempColumnDef = commonColumnExpression
df2 = (
df.withColumn('tempColumn', tempColumn)
.withColumn('newColumn1', col('existingColumn') + tempColumnDef)
.withColumn('newColumn2', col('existingColumn') - tempColumnDef)
)
df1.explain()
# == Physical Plan ==
# *(1) Project [existingColumn#7L, 2 AS tempColumn#9, (existingColumn#7L + 2) AS newColumn1#12L, (existingColumn#7L - 2) AS newColumn2#16L]
# +- *(1) Scan ExistingRDD[existingColumn#7L]
df2.explain()
# == Physical Plan ==
# *(1) Project [existingColumn#7L, 2 AS tempColumn#21, (existingColumn#7L + 2) AS newColumn1#24L, (existingColumn#7L - 2) AS newColumn2#28L]
# +- *(1) Scan ExistingRDD[existingColumn#7L]

Related

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2])
This issue is that the concatenated data frame is not using the cached data but is re-reading the source data. Which in our case is causing an Authentication issue as source.
This was not the behavior we had with pyspark 3.2.3.
This minimal code is able to show the issue.
import pyspark.pandas as ps
import pyspark
from pyspark.sql import SparkSession
import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable
spark = SparkSession.builder.appName('bug-pyspark3.3').getOrCreate()
df1 = ps.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, columns=['col1', 'col2'])
df2 = ps.DataFrame(data={'col3': [5, 6]}, columns=['col3'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
cached_df1.count()
cached_df2.count()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
merged_df.head()
merged_df.spark.explain()
Output of the explain() on pyspark 3.2.3 :
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [(cast(_we0#1300 as bigint) - 1) AS __index_level_0__#1298L, col1#1291L, col2#1292L, col3#1293L]
+- Window [row_number() windowspecdefinition(_w0#1299L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _we0#1300], [_w0#1299L ASC NULLS FIRST]
+- Sort [_w0#1299L ASC NULLS FIRST], false, 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=356]
+- Project [col1#1291L, col2#1292L, col3#1293L, monotonically_increasing_id() AS _w0#1299L]
+- Union
:- Project [col1#941L AS col1#1291L, col2#942L AS col2#1292L, null AS col3#1293L]
: +- InMemoryTableScan [col1#941L, col2#942L]
: +- InMemoryRelation [__index_level_0__#940L, col1#941L, col2#942L, __natural_order__#946L], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(1) Project [__index_level_0__#940L, col1#941L, col2#942L, monotonically_increasing_id() AS __natural_order__#946L]
: +- *(1) Scan ExistingRDD[__index_level_0__#940L,col1#941L,col2#942L]
+- Project [null AS col1#1403L, null AS col2#1404L, col3#952L]
+- InMemoryTableScan [col3#952L]
+- InMemoryRelation [__index_level_0__#951L, col3#952L, __natural_order__#955L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [__index_level_0__#951L, col3#952L, monotonically_increasing_id() AS __natural_order__#955L]
+- *(1) Scan ExistingRDD[__index_level_0__#951L,col3#952L]
We can see that the cache is used in the planned execution (InMemoryTableScan).
Output of the explain() on pyspark 3.3.0 :
== Physical Plan ==
AttachDistributedSequence[__index_level_0__#771L, col1#762L, col2#763L, col3#764L] Index: __index_level_0__#771L
+- Union
:- *(1) Project [col1#412L AS col1#762L, col2#413L AS col2#763L, null AS col3#764L]
: +- *(1) Scan ExistingRDD[__index_level_0__#411L,col1#412L,col2#413L]
+- *(2) Project [null AS col1#804L, null AS col2#805L, col3#423L]
+- *(2) Scan ExistingRDD[__index_level_0__#422L,col3#423L]
We can see on this version of pyspark that the Union is performed by doing a Scan of data instead of performing an InMemoryTableScan
Is this difference normal ? Is there any way to "force" the concat to use the cached dataframes ?

I cannot explain the difference in the planned execution output between pyspark 3.2.3 and 3.3.0, but I believe that despite this difference the cache is being used. I ran some benchmarks with and without caching using an example very similar to yours, and the average time for a merge operation to be performed is shorter when we cache the DataFrames.
def test_merge_without_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
start_time = time.time()
merged_df = ps.concat([df1,df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
def test_merge_with_cache(n=5, size=10**5):
np.random.seed(44)
total_run_times = []
for i in range(n):
data = np.random.rand(size,2)
data2 = np.random.rand(size,2)
df1 = ps.DataFrame(data, columns=['col1','col2'])
df2 = ps.DataFrame(data2, columns=['col3','col4'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
start_time = time.time()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
run_time = time.time() - start_time
total_run_times.append(run_time)
spark.catalog.clearCache()
return total_run_times
Here are the printouts from when I ran these two test functions:
total_run_times_without_cache = test_merge_without_cache(n=50, size=10**6)
np.mean(total_run_times_without_cache)
0.12456250190734863
total_run_times_with_cache = test_merge_with_cache(n=50, size=10**6)
np.mean(total_run_times_with_cache)
0.07876112937927246
This isn't the largest difference in speed so it's possible this is just noise and the cache is, in fact, not being used (but I did run this benchmark several times and the merge operation with cache was consistently faster). Someone with a better understanding of pyspark might be able to better explain what you're observing, but hopefully this answer helps a bit.
Here is a plot of the execution time between merge with and without cache:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(y=total_run_times_without_cache, name='without cache'))
fig.add_trace(go.Scatter(y=total_run_times_with_cache, name='with cache'))

Does spark re-calculates deterministic repeated expressions?

If I use the same deterministic expression twice or more in a query, will spark know to optimize and not-recalculate it?
I saw this question before, but the answer is to check the plan, and I don't really understand the plan to answer my question
Given this dataframe:
df = spark.createDataFrame([{'word': 'hello'}, {'word': 'goodbye'}])
+-------+
| word|
+-------+
| hello|
|goodbye|
+-------+
Let's say I want to add a column, that concatenate -world to the word column, but only if the result is hello-world
concatenated = F.concat_ws('-', 'word', F.lit('world'))
df.withColumn('result', F.when(concatenated == F.lit('hello-world'), concatenated))
The plan is:
== Parsed Logical Plan ==
'Project [word#5678, CASE WHEN (concat_ws(-, 'word, world) = hello-world) THEN concat_ws(-, 'word, world) END AS result#5680]
+- LogicalRDD [word#5678], false
== Analyzed Logical Plan ==
word: string, result: string
Project [word#5678, CASE WHEN (concat_ws(-, word#5678, world) = hello-world) THEN concat_ws(-, word#5678, world) END AS result#5680]
+- LogicalRDD [word#5678], false
== Optimized Logical Plan ==
Project [word#5678, CASE WHEN (concat_ws(-, word#5678, world) = hello-world) THEN concat_ws(-, word#5678, world) END AS result#5680]
+- LogicalRDD [word#5678], false
== Physical Plan ==
*(1) Project [word#5678, CASE WHEN (concat_ws(-, word#5678, world) = hello-world) THEN concat_ws(-, word#5678, world) END AS result#5680]
+- *(1) Scan ExistingRDD[word#5678]
So I can't really figure out if concat_ws(-, word#5678, world) gets recalculated
Another example that is much more complex
Add another column, that has all of the numbers doubled, where the doubled number is > 3, but only if the size of the resulting list is over 2
df = spark.createDataFrame([{'numbers': [1,2,3,4]}, {'numbers': [1,2]}])
filterd_list = F.filter(
F.transform('numbers', lambda x: x * 2),
lambda j: j > 3
)
df.withColumn('abc',
F.when(
F.size(
filterd_list
) >= 3,
filterd_list
)
).explain(True)
== Parsed Logical Plan ==
'Project [numbers#5648, CASE WHEN (size(filter(transform('numbers, lambdafunction((lambda 'x_52 * 2), lambda 'x_52, false)), lambdafunction((lambda 'x_53 > 3), lambda 'x_53, false)), true) >= 3) THEN filter(transform('numbers, lambdafunction((lambda 'x_52 * 2), lambda 'x_52, false)), lambdafunction((lambda 'x_53 > 3), lambda 'x_53, false)) END AS abc#5650]
+- LogicalRDD [numbers#5648], false
== Analyzed Logical Plan ==
numbers: array<bigint>, abc: array<bigint>
Project [numbers#5648, CASE WHEN (size(filter(transform(numbers#5648, lambdafunction((lambda x_52#5651L * cast(2 as bigint)), lambda x_52#5651L, false)), lambdafunction((lambda x_53#5653L > cast(3 as bigint)), lambda x_53#5653L, false)), true) >= 3) THEN filter(transform(numbers#5648, lambdafunction((lambda x_52#5652L * cast(2 as bigint)), lambda x_52#5652L, false)), lambdafunction((lambda x_53#5654L > cast(3 as bigint)), lambda x_53#5654L, false)) END AS abc#5650]
+- LogicalRDD [numbers#5648], false
== Optimized Logical Plan ==
Project [numbers#5648, CASE WHEN (size(filter(transform(numbers#5648, lambdafunction((lambda x_52#5651L * 2), lambda x_52#5651L, false)), lambdafunction((lambda x_53#5653L > 3), lambda x_53#5653L, false)), true) >= 3) THEN filter(transform(numbers#5648, lambdafunction((lambda x_52#5652L * 2), lambda x_52#5652L, false)), lambdafunction((lambda x_53#5654L > 3), lambda x_53#5654L, false)) END AS abc#5650]
+- LogicalRDD [numbers#5648], false
== Physical Plan ==
*(1) Project [numbers#5648, CASE WHEN (size(filter(transform(numbers#5648, lambdafunction((lambda x_52#5651L * 2), lambda x_52#5651L, false)), lambdafunction((lambda x_53#5653L > 3), lambda x_53#5653L, false)), true) >= 3) THEN filter(transform(numbers#5648, lambdafunction((lambda x_52#5652L * 2), lambda x_52#5652L, false)), lambdafunction((lambda x_53#5654L > 3), lambda x_53#5654L, false)) END AS abc#5650]
+- *(1) Scan ExistingRDD[numbers#5648]
These are just some made-up examples, but I come across this a lot dealing with structs and lists, sometimes with several more repeated expressions.
If the answer is the it does get re-calculated, is the way to overcome this is to use several withColumn expressions, dropping the intermediate ones in the end?

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date?
ISO week date is composed of year, week number and weekday.
year is not the same as the year obtained using year function.
week number is the easy part - it can be obtained using weekofyear.
weekday should return 1 for Monday and 7 for Sunday, while Spark's dayofweek cannot do it.
Example dataframe:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('1977-12-31',),
('1978-01-01',),
('1978-01-02',),
('1978-12-31',),
('1979-01-01',),
('1979-12-30',),
('1979-12-31',),
('1980-01-01',)],
['my_date']
).select(F.col('my_date').cast('date'))
df.show()
#+----------+
#| my_date|
#+----------+
#|1977-12-31|
#|1978-01-01|
#|1978-01-02|
#|1978-12-31|
#|1979-01-01|
#|1979-12-30|
#|1979-12-31|
#|1980-01-01|
#+----------+
Desired result:
+----------+-------------+
| my_date|iso_week_date|
+----------+-------------+
|1977-12-31| 1977-W52-6|
|1978-01-01| 1977-W52-7|
|1978-01-02| 1978-W01-1|
|1978-12-31| 1978-W52-7|
|1979-01-01| 1979-W01-1|
|1979-12-30| 1979-W52-7|
|1979-12-31| 1980-W01-1|
|1980-01-01| 1980-W01-2|
+----------+-------------+

Spark SQL extract makes this much easier.
iso_year = F.expr("EXTRACT(YEAROFWEEK FROM my_date)")
iso_weekday = F.expr("EXTRACT(DAYOFWEEK_ISO FROM my_date)")
So, building off of the other answers with the use of concat_ws:
import pyspark.sql.functions as F
df.withColumn(
'iso_week_date',
F.concat_ws(
"-",
F.expr("EXTRACT(YEAROFWEEK FROM my_date)"),
F.lpad(F.weekofyear('my_date'), 3, "W0"),
F.expr("EXTRACT(DAYOFWEEK_ISO FROM my_date)")
)
).show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

Your solution is already nice, maybe you could shorten it by simplifying the calculations:
iso_weekday = (dayofweek(my_date) + 5)%7 + 1
iso_year= year(date_add(my_date, 4 - iso_weekday))
Which gives you:
import pyspark.sql.functions as F
df.withColumn(
'iso_week_date',
F.concat_ws(
"-",
F.year(F.expr("date_add(my_date, 4 - (dayofweek(my_date) + 5) % 7 + 1)")),
F.lpad(F.weekofyear('my_date'), 3, "W0"),
(F.dayofweek('my_date') + 5) % 7 + 1
)
).show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

First, one could create rules for columns for year and weekday. Then, concatenate them using concat_ws and lpad.
week_from_prev_year = (F.month('my_date') == 1) & (F.weekofyear('my_date') > 9)
week_from_next_year = (F.month('my_date') == 12) & (F.weekofyear('my_date') == 1)
iso_year = F.when(week_from_prev_year, F.year('my_date') - 1) \
.when(week_from_next_year, F.year('my_date') + 1) \
.otherwise(F.year('my_date'))
iso_weekday = F.when(F.dayofweek('my_date') != 1, F.dayofweek('my_date')-1).otherwise(7)
iso_week_date = F.concat_ws('-', iso_year, F.lpad(F.weekofyear('my_date'), 3, 'W0'), iso_weekday)
df2 = df.withColumn('iso_week_date', iso_week_date)
df2.show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

Error TreeNodeException: execute, tree in PipelineModel.transform using Pyspark

So I am doing one-shot encoding in a pipeline and doing the fit method on it.
I have a data frame that has categorical as well as numerical columns, so I have one hot encoded categorical columns using string indexers.
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['IncomeDetails','B2C','Gender','Occupation','POA_Status']
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'target', outputCol = 'label')
stages += [label_stringIdx]
#new_col_array.remove("client_id")
numericCols = new_col_array
numericCols.append('age')
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_df1)
new_df1 = pipelineModel.transform(new_df1)
selectedCols = ['label', 'features'] + cols
I am getting this error :
Py4JJavaError: An error occurred while calling o2053.fit.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(client_id#*****, 200)
+- *(4) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- Exchange hashpartitioning(client_id#*****, 200)
+- *(3) HashAggregate(keys=[client_id#*****], functions=[], output=[client_id#*****])
+- *(3) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#27980])
+- Exchange hashpartitioning(client_id#*****, event_name#27993, 200)
+- *(2) HashAggregate(keys=[client_id#*****, event_name#27993], functions=[], output=[client_id#*****, event_name#27993])
+- *(2) Project [client_id#*****, event_name#27993]
+- *(2) BroadcastHashJoin [client_id#*****], [Party_Code#*****], LeftSemi, BuildRight, false
:- *(2) Project [client_id#*****, event_name#27993]
: +- *(2) Filter isnotnull(client_id#*****)
: +- *(2) FileScan orc dbo.dp_clickstream_[client_id#*****,event_name#27993,dt#28010] Batched: true, Format: ORC, Location: **PrunedInMemoryFileIndex**[s3n://processed/db-dbo-..., PartitionCount: 6, PartitionFilters: [isnotnull(dt#28010), (cast(dt#28010 as timestamp) >= 1610409600000000), (cast(dt#28010 as timest..., PushedFilters: [IsNotNull(client_id)], ReadSchema: struct<client_id:string,event_name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
My Spark version is 2.4.3

How to make Spark use partition information from Parquet files?

I am trying to precompute partitions for some SparkSql queries. If I compute and persist the the partitions, Spark uses them. If I save the partitioned data to Parquet and reload it later, the partition information is gone and Spark will recompute it.
The actual data is large enough that significant time is spent partitioning. The code below demonstrates the problems sufficiently though. Test2() is currently the only thing that I can get to work, but I would like to jumpstart the actual processing, which is what test3() is attempting to do.
Anyone know what I'm doing wrong? ..or if this is something that Spark can do?
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# NOTE: Need to have python in PATH, SPARK_HOME set to location of spark, HADOOP_HOME set to location of winutils
if __name__ == "__main__":
sc = SparkContext(appName="PythonPartitionBug")
sql_text = "select foo, bar from customer c, orders o where foo < 300 and c.custkey=o.custkey"
def setup():
sqlContext = SQLContext(sc)
fields1 = [StructField(name, IntegerType()) for name in ['custkey', 'foo']]
data1 = [(1, 110), (2, 210), (3, 310), (4, 410), (5, 510)]
df1 = sqlContext.createDataFrame(data1, StructType(fields1))
df1.persist()
fields2 = [StructField(name, IntegerType()) for name in ['orderkey', 'custkey', 'bar']]
data2 = [(1, 1, 10), (2, 1, 20), (3, 2, 30), (4, 3, 40), (5, 4, 50)]
df2 = sqlContext.createDataFrame(data2, StructType(fields2))
df2.persist()
return sqlContext, df1, df2
def test1():
# Without repartition the final plan includes hashpartitioning
# == Physical Plan ==
# Project [foo#1,bar#14]
# +- SortMergeJoin [custkey#0], [custkey#13]
# :- Sort [custkey#0 ASC], false, 0
# : +- TungstenExchange hashpartitioning(custkey#0,200), None
# : +- Filter (foo#1 < 300)
# : +- InMemoryColumnarTableScan [custkey#0,foo#1], [(foo#1 < 300)], InMemoryRelation [custkey#0,foo#1], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
# +- Sort [custkey#13 ASC], false, 0
# +- TungstenExchange hashpartitioning(custkey#13,200), None
# +- InMemoryColumnarTableScan [bar#14,custkey#13], InMemoryRelation [orderkey#12,custkey#13,bar#14], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
sqlContext, df1, df2 = setup()
df1.registerTempTable("customer")
df2.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
def test2():
# With repartition the final plan does not include hashpartitioning
# == Physical Plan ==
# Project [foo#56,bar#69]
# +- SortMergeJoin [custkey#55], [custkey#68]
# :- Sort [custkey#55 ASC], false, 0
# : +- Filter (foo#56 < 300)
# : +- InMemoryColumnarTableScan [custkey#55,foo#56], [(foo#56 < 300)], InMemoryRelation [custkey#55,foo#56], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#55,4), None, None
# +- Sort [custkey#68 ASC], false, 0
# +- InMemoryColumnarTableScan [bar#69,custkey#68], InMemoryRelation [orderkey#67,custkey#68,bar#69], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#68,4), None, None
sqlContext, df1, df2 = setup()
df1a = df1.repartition(4, 'custkey').persist()
df1a.registerTempTable("customer")
df2a = df2.repartition(4, 'custkey').persist()
df2a.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
def test3():
# After round tripping the partitioned data, the partitioning is lost and spark repartitions
# == Physical Plan ==
# Project [foo#223,bar#284]
# +- SortMergeJoin [custkey#222], [custkey#283]
# :- Sort [custkey#222 ASC], false, 0
# : +- TungstenExchange hashpartitioning(custkey#222,200), None
# : +- Filter (foo#223 < 300)
# : +- InMemoryColumnarTableScan [custkey#222,foo#223], [(foo#223 < 300)], InMemoryRelation [custkey#222,foo#223], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[custkey#222,foo#223] InputPaths: file:/E:/.../df1.parquet, None
# +- Sort [custkey#283 ASC], false, 0
# +- TungstenExchange hashpartitioning(custkey#283,200), None
# +- InMemoryColumnarTableScan [bar#284,custkey#283], InMemoryRelation [orderkey#282,custkey#283,bar#284], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[orderkey#282,custkey#283,bar#284] InputPaths: file:/E:/.../df2.parquet, None
sqlContext, df1, df2 = setup()
df1a = df1.repartition(4, 'custkey').persist()
df1a.write.parquet("df1.parquet", mode='overwrite')
df1a = sqlContext.read.parquet("df1.parquet")
df1a.persist()
df1a.registerTempTable("customer")
df2a = df2.repartition(4, 'custkey').persist()
df2a.write.parquet("df2.parquet", mode='overwrite')
df2a = sqlContext.read.parquet("df2.parquet")
df2a.persist()
df2a.registerTempTable("orders")
df3 = sqlContext.sql(sql_text)
df3.collect()
df3.explain(True)
test1()
test2()
test3()
sc.stop()

You're doing nothing wrong - but you can't achieve what you're trying to achieve with Spark: the partitioner used to save a file is necessarily lost when writing to disk. Why? Because Spark doesn't have its own file format, it relies on existing formats (e.g. Parquet, ORC or text files), none of which are even aware of the Partitioner (which is internal to Spark) so they can't persist that information. The data is properly partitioned on disk, but Spark has no way of knowing what partitioner was used when it loads from disk, so it has no choice but to re-partition.
The reason test2() doesn't reveal this is that you reuse the same DataFrame instances, which do store the partitioning information (in memory).

A better solution would be to use persist(StorageLevel.MEMORY_AND_DISK_ONLY) which will spill the RDD/DF partitions to the Worker's local disk if they're evicted from memory. In this case, rebuilding a partition only requires pulling data from the Worker's local disk which is relatively fast.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create an intermediate calculated column or expand the definition - apache-spark

Related

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Does spark re-calculates deterministic repeated expressions?

Convert date to ISO week date in Spark

Error TreeNodeException: execute, tree in PipelineModel.transform using Pyspark

How to make Spark use partition information from Parquet files?

Categories

Resources