Cache() in Pyspark Dataframe - apache-spark

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?
df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()
Should I add like this or caching once is enough ?
Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !

The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.
Below is the source code for cache() from spark documentation
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY_SER)
return self

Related

Spark caching - when to cache after foreachbatch (Spark Streaming)

I'm currently reading from a Kafka topic using spark streaming. Then, ForEachBatch (df), I do some transformations. I first filter the df batch by an id (df_filtered - i can do this filter n amount of times), then create a dataframe based on that filtered df (new_df_filtered - because the the data comes as a json message and I want to convert it to a normal column structure, providing it the schema), and finally writing in to 2 sinks.
Here's a sample of the code:
def sink_process(self, df: DataFrame, current_ids: list):
df.repartition(int(os.environ.get("SPARK_REPARTITION_NUMBER")))
df.cache()
for id in current_ids:
df_filtered = self.df_filter_by_id(df, id) #this returns the new dataframe with the schema. Uses a .where and then a .createDataFrame
first_row = df_filtered.take(1) #making sure that this filter action returns any data
if first_row:
df_filtered.cache()
self.sink_process(df_filtered, id)
df_filtered.unpersist()
df.unpersist()
My question is where should I cache this data for optimal performance. Right now I cached the batch before applying any transformations, which I have come to realise that at that point is not really doing anything, as it's only cached when the first action occurs. So following this logic, i'm only really caching this df when i'm reaching that .take, right? But at this point, i'm also caching that filtered df. The idea behind caching the batch data before the filter was that if had a log different ids, i wasn't fetching the data every time I was doing the filter, but I might have gotten this all wrong.
Can anyone please help clarify what would be the best approach? Maybe only caching the df_filtered one which is going to be used for the different sinks?
Thanks

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Reusing pyspark cache and unpersist in for loop

I have a lot of data that I'm trying to take out in chunks - let's say 3 chunks - and not have it all cached in memory at once. However, I'd like to save it (action) all at the same time afterwards.
This is the current simplified strategy:
for query in [query1,query2,query3]:
df = spark.sql(query)
df.cache()
df1 = df.filter('a')
df2 = df.filter('b')
final_output_1 = final_output_1.join(df1)
final_output_2 = final_output_2.join(df2)
df.unpersist()
final_output_1.write.saveAsTable()
final_output_2.write.saveAsTable()
So first question: would unpersist() not work here since there hasn't been an action yet on df?
second question: how does df.cache() work here when I'm reusing the df variable in the for loop? I know it's immutable so it would make a copy but would the unpersist() actually clear that memory?
Caching is used in Spark when you want to re use a dataframe again and again ,
for ex: mapping tables
once you cache teh df you need an action operation to physicaly move data to memory as spark is based on lazy execution.
In your case
df.cache()
will not work as expected as you are not performing an action after this.
For cache to work you need to run df.count() or df.show() or any other action for the data to be moved to memory , otherwise your data wont be moved to memory and you will not get any advantage. and so the df.unpersist() is also redundant.
First Question:
No your df.cache() and df.unpersist() will not work as no data was cached to begin with so their is nothing to unpersist.
Second Question:
Yes you can use the same variable name and if an action is performed data will get cached and after your operations df.unpersist() will unpersist the data in each loop.
So the previous DF has no connection to the next DF in next loop.
As you said they are immutable , and since you are assigning new query to the same variable in each loop it acts as a new DF (not related to previous DF).
Based on your code i dont think u need to do caching as you are only performing one operation.
refer to When to cache a DataFrame? and If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache()
df2.cache()
Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?
For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.
just do the following:
df1.unpersist()
df2.unpersist()
Spark automatically monitors cache usage on each node and drops out
old data partitions in a least-recently-used (LRU) fashion. If you
would like to manually remove an RDD instead of waiting for it to fall
out of the cache, use the RDD.unpersist() method.
If the dataframe registered as a table for SQL operations, like
df.createGlobalTempView(tableName) // or some other way as per spark verision
then the cache can be dropped with following commands, off-course spark also does it automatically
Spark >= 2.x
Here spark is an object of SparkSession
Drop a specific table/df from cache
spark.catalog.uncacheTable(tableName)
Drop all tables/dfs from cache
spark.catalog.clearCache()
Spark <= 1.6.x
Drop a specific table/df from cache
sqlContext.uncacheTable(tableName)
Drop all tables/dfs from cache
sqlContext.clearCache()
If you need to block during removal => df2.unpersist(true)
Unblocking removal => df2.unpersist()
Here is a simple utility context manager that takes care of that for you:
#contextlib.contextmanager
def cached(df):
df_cached = df.cache()
try:
yield df_cached
finally:
df_cached.unpersist()

How far will Spark RDD cache go?

Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?
The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

Resources