How far will Spark RDD cache go? - apache-spark

Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?

The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

Related

Reusing pyspark cache and unpersist in for loop

I have a lot of data that I'm trying to take out in chunks - let's say 3 chunks - and not have it all cached in memory at once. However, I'd like to save it (action) all at the same time afterwards.
This is the current simplified strategy:
for query in [query1,query2,query3]:
df = spark.sql(query)
df.cache()
df1 = df.filter('a')
df2 = df.filter('b')
final_output_1 = final_output_1.join(df1)
final_output_2 = final_output_2.join(df2)
df.unpersist()
final_output_1.write.saveAsTable()
final_output_2.write.saveAsTable()
So first question: would unpersist() not work here since there hasn't been an action yet on df?
second question: how does df.cache() work here when I'm reusing the df variable in the for loop? I know it's immutable so it would make a copy but would the unpersist() actually clear that memory?
Caching is used in Spark when you want to re use a dataframe again and again ,
for ex: mapping tables
once you cache teh df you need an action operation to physicaly move data to memory as spark is based on lazy execution.
In your case
df.cache()
will not work as expected as you are not performing an action after this.
For cache to work you need to run df.count() or df.show() or any other action for the data to be moved to memory , otherwise your data wont be moved to memory and you will not get any advantage. and so the df.unpersist() is also redundant.
First Question:
No your df.cache() and df.unpersist() will not work as no data was cached to begin with so their is nothing to unpersist.
Second Question:
Yes you can use the same variable name and if an action is performed data will get cached and after your operations df.unpersist() will unpersist the data in each loop.
So the previous DF has no connection to the next DF in next loop.
As you said they are immutable , and since you are assigning new query to the same variable in each loop it acts as a new DF (not related to previous DF).
Based on your code i dont think u need to do caching as you are only performing one operation.
refer to When to cache a DataFrame? and If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Does spark cache rdds automatically after shuffle?

I tried the following code, rdd2 will compute only once. Dose spark cache all shuffle rdd automatically ?
I noticed the Dataframe shuffle result will not auto cache.
val rdd2: RDD[(String, Int)] = spark.sparkContext.parallelize(Array("jan"))
.map(x => {
println("---")
(x, 1)
}).reduceByKey(_ + _)
rdd2.count()
rdd2.count()
Some data (like intermediate shuffle data) is persisted automatically.
The official website has the following sentence
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Cache() in Pyspark Dataframe

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?
df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()
Should I add like this or caching once is enough ?
Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !
The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.
Below is the source code for cache() from spark documentation
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY_SER)
return self

Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once?

I am little confused with the caching mechanism of Spark.
Let's say I have a Spark application with only one action at the end of multiple transformations. In which suppose I have a dataframe A and I applied 2-3 transformation on it, creating multiple dataframes which eventually helps creating a last dataframe which is going to be saved to disk.
example :
val A=spark.read() // large size
val B=A.map()
val C=A.map()
.
.
.
val D=B.join(C)
D.save()
So do I need to cache dataframe A for performance enhancement?
Thanks in advance.
I think thebluephantom's answer is right.
I had faced same situation with you until today, and i also found answers only saying Spark cache() does not work on single query.
And also my spark job executing single query seems not caching.
Because of them, i was also doubted for he's answer.
But he showed evidences for cache is working with a green box even he execute single query.
So, i tested 3 cases with dataframe(not RDD) like below and the results seems he is right.
And execution plan is also changed (more simple and use InMemoryRelation, please see the below).
without cache
using cache
using cache with calling unpersist before action
without cache
example
val A = spark.read.format().load()
val B = A.where(cond1).select(columns1)
val C = A.where(cond2).select(columns2)
val D = B.join(C)
D.save()
DAG for my case
This is a bit more complicated than example.
This DAG is messy even though there is no complicated execution.
And you can see the scan is occured 4 times.
with cache
example
val A = spark.read.format().load().cache() // cache will be working
val B = A.where(cond1).select(columns1)
val C = A.where(cond2).select(columns2)
val D = B.join(C)
D.save()
This will cache A, even single query.
You can see DAG that read InMemoryTableScan twice.
DAG for my case
with cache and unpersist before action
val A = spark.read.format().load().cache()
val B = A.where(cond1).select(columns1)
val C = A.where(cond2).select(columns2)
/* I thought A will not be needed anymore */
A.unpersist()
val D = B.join(C)
D.save()
This code will not cache A dataframe, because it was unset cache flag before starting action. (D.save())
So, this will result in exactly same with first case (without cache).
Important thing is unpersist() must be written after action(after D.save()).
But when i ask some people in my company, many of them used like case 3 and didn't know about this.
I think that's why many people misunderstand cache is not working on single query.
cache and unpersist should be like below
val A = spark.read.format().load().cache()
val B = A.where(cond1).select(columns1)
val C = A.where(cond2).select(columns2)
val D = B.join(C)
D.save()
/* unpersist must be after action */
A.unpersist()
This result exactly same with case 2 (with cache, but unpersist after D.save()).
So. I suggest try cache like thebluephantom's answer.
If i present any incorrection. please note that.
Thanks to thebluephantom's for solving my problem.
Yes, you are correct.
You should cache A as it used for B & C as input. The DAG visualization would show the extent of reuse or going back to source (in this case). If you have a noisy cluster, some spilling to disk could occur.
See also top answer here (Why) do we need to call cache or persist on a RDD
However, I was looking for skipped stages, silly me. But something else shows as per below.
The following code akin to your own code:
val aa = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")//.cache
val a = aa.flatMap(x => x.split(" ")).map(_.trim)
val b=a.map(x => (x,1))
val c=a.map(x => (x,2))
val d=b.join(c)
d.count
Looking at UI with .cache
and without .cache
QED: SO, .cache has benefit. It would not make sense otherwise. Also, 2 reads could lead to different results in some cases.

Spark does persistence required for single action?

I have a workflow like below:
rdd1 = sc.textFile(input);
rdd2 = rdd1.filter(filterfunc1);
rdd3 = rdd1.filter(fiterfunc2);
rdd4 = rdd2.map(mapptrans1);
rdd5 = rdd3.map(maptrans2);
rdd6 = rdd4.union(rdd5);
rdd6.foreach(some transformation);
1.Do I need to persist rdd1 ?Or its not required since there is only one action at rdd6 which will create only one job and in a single job no need of persist ?
2.Also what if transformation on rdd2 is reduceByKey instead of map ? Will this again the same thing no need of persist since single job.
You only need to persist if you plan to reuse the RDD in more than one action. In a single action, spark does a good job of deciding when to recalculate and when to reuse.
You can see the DAG in the UI to make sure the rdd1 is only read once from file.

Resources