Reusing pyspark cache and unpersist in for loop - apache-spark

I have a lot of data that I'm trying to take out in chunks - let's say 3 chunks - and not have it all cached in memory at once. However, I'd like to save it (action) all at the same time afterwards.
This is the current simplified strategy:
for query in [query1,query2,query3]:
df = spark.sql(query)
df.cache()
df1 = df.filter('a')
df2 = df.filter('b')
final_output_1 = final_output_1.join(df1)
final_output_2 = final_output_2.join(df2)
df.unpersist()
final_output_1.write.saveAsTable()
final_output_2.write.saveAsTable()
So first question: would unpersist() not work here since there hasn't been an action yet on df?
second question: how does df.cache() work here when I'm reusing the df variable in the for loop? I know it's immutable so it would make a copy but would the unpersist() actually clear that memory?

Caching is used in Spark when you want to re use a dataframe again and again ,
for ex: mapping tables
once you cache teh df you need an action operation to physicaly move data to memory as spark is based on lazy execution.
In your case
df.cache()
will not work as expected as you are not performing an action after this.
For cache to work you need to run df.count() or df.show() or any other action for the data to be moved to memory , otherwise your data wont be moved to memory and you will not get any advantage. and so the df.unpersist() is also redundant.
First Question:
No your df.cache() and df.unpersist() will not work as no data was cached to begin with so their is nothing to unpersist.
Second Question:
Yes you can use the same variable name and if an action is performed data will get cached and after your operations df.unpersist() will unpersist the data in each loop.
So the previous DF has no connection to the next DF in next loop.
As you said they are immutable , and since you are assigning new query to the same variable in each loop it acts as a new DF (not related to previous DF).
Based on your code i dont think u need to do caching as you are only performing one operation.
refer to When to cache a DataFrame? and If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Related

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Does spark automatically un-cache and delete unused dataframes?

I have the following strategy to change a dataframe df.
df = T1(df)
df.cache()
df = T2(df)
df.cache()
.
.
.
df = Tn(df)
df.cache()
Here T1, T2, ..., Tn are n transformations that return spark dataframes. Repeated caching is used because df has to pass through a lot of transformations and used mutiple times in between; without caching lazy evaluation of the transformations might make using df in between very slow. What I am worried about is that the n dataframes that are cached one by one will gradually consume the RAM. I read that spark automatically un-caches "least recently used" items. Based on this I have the following queries -
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Based on the answer to the above two queries, is the above strategy correct?
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Results are cached on spark executors. A single executor runs multiple tasks and could have multiple caches in its memory at a given point in time. A single executor caches are ranked based on when it is asked. Cache just asked in some computation will have rank 1 always, and others are pushed down. Eventually when available space is full, cache with last rank is dropped to make space for new cache.
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Dataframe is an execution expression and unless an action is called, no computation is materialised. Moreover, everything will be cleared once the executor is done with computation for that task. Only when dataframe is cached (before calling action), results are kept aside in executor memory for further use. And these result caches are cleared based on LRU.
Based on the answer to the above two queries, is the above strategy correct?
Your example seems like transformation are done in sequence and reference for previous dataframe is not used further (no idea why you are using cache). If multiple executions are done by same executor, it is possible that some results are dropped and when asked they will be re-computed again.
N.B. - Nothing is executed unless a spark action is called. Transformations are chained and optimised by spark engine when an action is called.
As far as I have worked with spark and also with the communication with the cloudera that I had, we should unpersist/uncache the data, if we do not do that job will start to slow down, the problem becomes more severe in case of streaming job.
I have nothing to support my answer but
read here and here for details

Cache() in Pyspark Dataframe

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?
df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()
Should I add like this or caching once is enough ?
Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !
The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.
Below is the source code for cache() from spark documentation
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY_SER)
return self

Why does Dataset.unpersist cascade to all dependent cached Datasets?

I am using spark 2.3.2. For my use case, I'm caching first dataframe and then second dataframe.
Trying to replicate the same.
scala> val df = spark.range(1, 1000000).withColumn("rand", (rand * 100).cast("int")).cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, rand: int]
scala> df.count
res0: Long = 999999
scala> val aggDf = df.groupBy("rand").agg(count("id") as "count").cache
aggDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [rand: int, count: bigint]
scala> aggDf.count
res1: Long = 100
As use can see in below image, There are two RDD's for each dataframe.
Now, When I'm going to unpersist my first dataframe, spark is unpersisting both.
df.unpersist()
Trying to understand this weird behaviour, Why spark is unpersisting both dataframe instead of first?
Am I missing something?
Quoting SPARK-21478 Unpersist a DF also unpersists related DFs:
This is by design. We do not want to use the invalid cached data.
The current cache design need to ensure the query correctness. If you want to keep the cached data, even if the data is stale. You need to materialize it by saving it as a table.
That however has been changed in 2.4.0 in SPARK-24596 Non-cascading Cache Invalidation:
When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.
However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.
Since you're using 2.3.2 you have to follow the recommendation to save to a table or upgrade to 2.4.0.

How far will Spark RDD cache go?

Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?
The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

Resources