Why does Dataset.unpersist cascade to all dependent cached Datasets? - apache-spark

I am using spark 2.3.2. For my use case, I'm caching first dataframe and then second dataframe.
Trying to replicate the same.
scala> val df = spark.range(1, 1000000).withColumn("rand", (rand * 100).cast("int")).cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, rand: int]
scala> df.count
res0: Long = 999999
scala> val aggDf = df.groupBy("rand").agg(count("id") as "count").cache
aggDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [rand: int, count: bigint]
scala> aggDf.count
res1: Long = 100
As use can see in below image, There are two RDD's for each dataframe.
Now, When I'm going to unpersist my first dataframe, spark is unpersisting both.
df.unpersist()
Trying to understand this weird behaviour, Why spark is unpersisting both dataframe instead of first?
Am I missing something?

Quoting SPARK-21478 Unpersist a DF also unpersists related DFs:
This is by design. We do not want to use the invalid cached data.
The current cache design need to ensure the query correctness. If you want to keep the cached data, even if the data is stale. You need to materialize it by saving it as a table.
That however has been changed in 2.4.0 in SPARK-24596 Non-cascading Cache Invalidation:
When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.
However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.
Since you're using 2.3.2 you have to follow the recommendation to save to a table or upgrade to 2.4.0.

Related

Reusing pyspark cache and unpersist in for loop

I have a lot of data that I'm trying to take out in chunks - let's say 3 chunks - and not have it all cached in memory at once. However, I'd like to save it (action) all at the same time afterwards.
This is the current simplified strategy:
for query in [query1,query2,query3]:
df = spark.sql(query)
df.cache()
df1 = df.filter('a')
df2 = df.filter('b')
final_output_1 = final_output_1.join(df1)
final_output_2 = final_output_2.join(df2)
df.unpersist()
final_output_1.write.saveAsTable()
final_output_2.write.saveAsTable()
So first question: would unpersist() not work here since there hasn't been an action yet on df?
second question: how does df.cache() work here when I'm reusing the df variable in the for loop? I know it's immutable so it would make a copy but would the unpersist() actually clear that memory?
Caching is used in Spark when you want to re use a dataframe again and again ,
for ex: mapping tables
once you cache teh df you need an action operation to physicaly move data to memory as spark is based on lazy execution.
In your case
df.cache()
will not work as expected as you are not performing an action after this.
For cache to work you need to run df.count() or df.show() or any other action for the data to be moved to memory , otherwise your data wont be moved to memory and you will not get any advantage. and so the df.unpersist() is also redundant.
First Question:
No your df.cache() and df.unpersist() will not work as no data was cached to begin with so their is nothing to unpersist.
Second Question:
Yes you can use the same variable name and if an action is performed data will get cached and after your operations df.unpersist() will unpersist the data in each loop.
So the previous DF has no connection to the next DF in next loop.
As you said they are immutable , and since you are assigning new query to the same variable in each loop it acts as a new DF (not related to previous DF).
Based on your code i dont think u need to do caching as you are only performing one operation.
refer to When to cache a DataFrame? and If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Cache() in Pyspark Dataframe

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?
df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()
Should I add like this or caching once is enough ?
Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !
The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.
Below is the source code for cache() from spark documentation
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY_SER)
return self

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache()
df2.cache()
Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?
For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.
just do the following:
df1.unpersist()
df2.unpersist()
Spark automatically monitors cache usage on each node and drops out
old data partitions in a least-recently-used (LRU) fashion. If you
would like to manually remove an RDD instead of waiting for it to fall
out of the cache, use the RDD.unpersist() method.
If the dataframe registered as a table for SQL operations, like
df.createGlobalTempView(tableName) // or some other way as per spark verision
then the cache can be dropped with following commands, off-course spark also does it automatically
Spark >= 2.x
Here spark is an object of SparkSession
Drop a specific table/df from cache
spark.catalog.uncacheTable(tableName)
Drop all tables/dfs from cache
spark.catalog.clearCache()
Spark <= 1.6.x
Drop a specific table/df from cache
sqlContext.uncacheTable(tableName)
Drop all tables/dfs from cache
sqlContext.clearCache()
If you need to block during removal => df2.unpersist(true)
Unblocking removal => df2.unpersist()
Here is a simple utility context manager that takes care of that for you:
#contextlib.contextmanager
def cached(df):
df_cached = df.cache()
try:
yield df_cached
finally:
df_cached.unpersist()

Spark SQL createDataFrame() raising OutOfMemory exception

Does it create the whole dataFrame in Memory?
How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?
To persist it for later queries:
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()
This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.
In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

repartition() is not affecting RDD partition size

I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-
scala> rdd.partitions.size
res56: Int = 50
scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27
At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-
scala> rdd.partitions.size
res58: Int = 50
As one can see, it's not changing. Can someone answer why?
First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.

Resources