In spark, is there any way to unpersist a dataframe/rdd in the middle of execution plan - apache-spark

Given the following series of events:
df1 = read
df2 = df1.action
df3 = df1.action
df2a = df2.action
df2b = df2.action
df3a = df3.action
df3b = df3.action
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()
The data forks twice, so that df1 will be read 4 times. I therefore want to persist the data. From what I understand this is the way to do so:
df1 = read
df1.persist()
df2 = df1.action
df3 = df1.action
df2.persist()
df3.persist()
df2a = df2.action
df2b = df2.action
df3a = df3.action
df3b = df3.action
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()
df1.unpersist()
df2.unpersist()
df3.unpersist()
However this keeps all three in memory at once, which isn't storage efficient considering I no longer need df1 persisted after df2 and df3 are both created. I'd like to order it more like this:
df1 = read
df1.persist()
df2 = df1.action
df3 = df1.action
df1.unpersist()
df2.persist()
df3.persist()
df2a = df2.action
df2b = df2.action
df2.unpersist()
df3a = df3.action
df3b = df3.action
df3.unpersist()
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()
However this just leads to the data not being persisted at all, because I need to trigger an action before unpersisting. Is there any way to accomplish what I'm looking for (unpersisting intermediate dataframes in the middle of the execution plan)?

This is not possible but can be rearranged slightly better.
Transformations build DAGs without execution, the actual persistence happens with execution triggered by an action. If cached parent RDD(s) is unpersisted then all child RDD(s) are also unpersisted. It's a design choice to focus more on correctness of the data and its consistency. This is the reason data is not being persisted at all.
Slightly improving your steps,
df1 = read
df1.persist()
df2 = df1.action # after this df1 will be persisted
df3 = df1.action # this will be faster as df1 is cached
df2.persist()
df3.persist()
# perform 1 action on df2 and df3 each to trigger their caching
df2a = df2.action
df3a = df3.action
df2b = df2.action # this will be faster as df2 is cached
df3b = df3.action # this will be faster as df3 is cached
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()
df1.unpersist() # this along with dependents will get un persisted
Related References:
https://github.com/apache/spark/pull/17097
https://issues.apache.org/jira/browse/SPARK-21579

Related

Performance benefit to separate actions in a series of transformation

Consider a series of heavy transformation:
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df3 = df2.join(...)
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count
is there any performance benefit to add action to each transformation, instead of take action in the last transformation?
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df2 .count // does it help to ease the performance impact by df5.count
val df3 = df2.join(...)
val df3.count // does it help to ease the performance impact by df5.count
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df4.count // does it help to ease the performance impact by df5.count
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count

Does spark caching required on the last common part of 2 actions?

My code:
df1 = sql_context.sql("select * from table1") #should I cache here?
df2 = sql_context.sql("select * from table2") #should I cache here?
df1 = df1.where(df1.id == '5')
df1 = df1.where(df1.city == 'NY')
joined_df = df1.join(df2, on = "key") # should I cache here?
output_df = joined_df.where(joined_df.x == 5)
joined_df.write.format("csv").save(path1)
output_df.write.format("csv").save(path2)
So, I have 2 actions in the code, both of them make filters on df1 and join the data with df2.
Where is the right place to use cache() in this code?
Should I cache df1 and df2 because they will be used in both of the actions.
Or should I cache only the joined_df that is the last common part between this 2 actions?

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17

Pandas: Join multiple data frame on the same keys

I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])
Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()

Drop previous pandas tables after merged into 1

I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4

Resources