What is the difference between cache and persist? - apache-spark

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

With cache(), you use only the default storage level :
MEMORY_ONLY for RDD
MEMORY_AND_DISK for Dataset
With persist(), you can specify which storage level you want for both RDD and Dataset.
From the official docs:
You can mark an RDD to be persisted using the persist() or cache() methods on it.
each persisted RDD can be stored using a different storage level
The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
Use persist() if you want to assign a storage level other than :
MEMORY_ONLY to the RDD
or MEMORY_AND_DISK for Dataset
Interesting link for the official documentation : which storage level to choose

The difference between cache and persist operations is purely
syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e.
cache is merely persist with the default storage level MEMORY_ONLY
But Persist()
We can save the intermediate results in 5 storage levels.
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def persist(): this.type =
persist(StorageLevel.MEMORY_ONLY)
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def cache(): this.type = persist()
see more details here...
Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storage like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
#persist, cache
These functions can be used to adjust the storage level of a RDD.
When freeing up memory, Spark will use the storage level identifier to
decide which partitions should be kept. The parameter less variants
persist() and cache() are just abbreviations for
persist(StorageLevel.MEMORY_ONLY).
Warning: Once the storage level has been changed, it cannot be changed again!
Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)
Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.
It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..
Listing Variants...
def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]
See below example :
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Note :
Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.
See more visually here....
Persist in memory and disk:
Cache:
Caching can improve the performance of your application to a great extent.
In general, it is recommended to use persist with a specific storage level to have more control over caching behavior, while cache can be used as a quick and convenient way to cache data in memory.

There is no difference. From RDD.scala.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

Spark gives 5 types of Storage level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).
By default persist() will
store the data in the JVM heap as unserialized objects.

Cache() and persist() both the methods are used to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages.
The only difference between cache() and persist() is ,using Cache technique we can save intermediate results in memory only when needed while in Persist() we can save the intermediate results in 5 storage levels(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY).

For impatient:
Same
Without passing argument, persist() and cache() are the same, with default settings:
when RDD: MEMORY_ONLY
when Dataset: MEMORY_AND_DISK
Difference:
Unlike cache(), persist() allows you to pass argument inside the bracket, in order to specify the level:
persist(MEMORY_ONLY)
persist(MEMORY_ONLY_SER)
persist(MEMORY_AND_DISK)
persist(MEMORY_AND_DISK_SER )
persist(DISK_ONLY )
Voilà!

Related

Is there config for StorageLevel in Spark? [duplicate]

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.
I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

Is it possible to set the default storage level in Spark?

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.
I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it.
Does anyone know why the default persistence levels are different?
Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.
Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.
The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):
Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
As mentioned by #user6910411 "Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core." i.e dataset/dataframe apis use column buffers to store the column datattype and column details about the raw data so in case while caching the data does not fit in to memory then it will not cache the rest of the partition and will recompute whenever needed.So in the case of dataset/dataframe the recomputation cost is more compared to rdd due to its columnar structure.So the default persist option changed to MEMORY_AND_DISK so that the blocks that does not fit in to memory will spill to disk and it will retrieved from disk whenever needed rather than recomputing next time.

Storing intermediate data in Spark when there are 100s of operations in an application

An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data?
I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)
While cache() and persist() is generic checkpoint is something which is specific to streaming.
caching - caching might happen on memory or disk
rdd.cache()
persist - you can give option where you want to persist your data either in memory or disk
rdd.persist(storage level)
checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)
val ssc = new StreamingContext(...) // new context
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
There is a significant difference between cache/persist and checkpoint.
Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.
However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html
(Why) do we need to call cache or persist on a RDD

Is MEMORY_AND_DISK always better that DISK_ONLY when persisting an RDD to disk?

Using Apache Spark why would I choose to persist an RDD using storage level DISK_ONLY rather than using MEMORY_AND_DISK or MEMORY_AND_DISK_SER ?
Is there any use-case where using DISK_ONLY would give better performance than MEMORY_AND_DISK or MEMORY_AND_DISK_SER.
Simple example - you may have one relatively great RDD rdd1 and one smalled RDD rdd2. You want to store both of them.
If you apply persist MEMORY_AND_DISK on both, then both of them will be spilled to disk resulting in slower reaed.
But you may take a different approach - you may store rdd1 with DISK_ONLY. It may just so happen that thanks to this move you can store rdd2 right in the memory with cache() option and you will be able to read it faster.

Resources