In spark we have cache and persist, used to save the RDD.
As per my understanding cache and persist/MEMORY_AND_DISK both perform same action for DataFrames.
If this is the case why should I prefer using cache at all, I can always use persist [with different parameters] and ignore cache.
Could you please let me know, when to use cache, or if my understanding is wrong.
There is no profound difference between cache and persist. Calling cache() is strictly equivalent to calling persist without argument which defaults to the MEMORY_AND_DISK storage level.
Here is the source code of the cache() method:
/**
* Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
*
* #group basic
* #since 1.6.0
*/
def cache(): this.type = persist()
Related
In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.
I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html
In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.
I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html
Why spark added cache() method in its library i.e. rdd.py even though it internally calls self.persist(StorageLevel.MEMORY_ONLY) as stated below:
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self
cache is a convenience method to cache a Dataframe. Persist is an advanced method which can take storage level as parameter and persist the dataframe accordingly.
The default storage level for cache and persist are same and as you mentioned duplicated. You can use either.
In Scala implementation cache calls persist def cache(): this.type = persist(). This tells me that persist is the real implementation and cache is sugar syntax.
I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.
I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.
Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).
Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?
It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).
But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.
I think for your use case mapWithState operator is exactly what you're after.
Spark does not work that way. Please think it through in a distributed way.
For the first part of keeping in RAM. You can use cache() or persist() anyone as by default they keep data in memory, of the worker.
You can verify this from Apache Spark Code.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
As far as I understand your use case, you need the UpdateStateByKey Operation to implement your second use case !
For more on Windowing see here.
In terms of RDD persistence, what are the differences between cache() and persist() in spark ?
With cache(), you use only the default storage level :
MEMORY_ONLY for RDD
MEMORY_AND_DISK for Dataset
With persist(), you can specify which storage level you want for both RDD and Dataset.
From the official docs:
You can mark an RDD to be persisted using the persist() or cache() methods on it.
each persisted RDD can be stored using a different storage level
The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
Use persist() if you want to assign a storage level other than :
MEMORY_ONLY to the RDD
or MEMORY_AND_DISK for Dataset
Interesting link for the official documentation : which storage level to choose
The difference between cache and persist operations is purely
syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e.
cache is merely persist with the default storage level MEMORY_ONLY
But Persist()
We can save the intermediate results in 5 storage levels.
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def persist(): this.type =
persist(StorageLevel.MEMORY_ONLY)
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def cache(): this.type = persist()
see more details here...
Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storage like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
#persist, cache
These functions can be used to adjust the storage level of a RDD.
When freeing up memory, Spark will use the storage level identifier to
decide which partitions should be kept. The parameter less variants
persist() and cache() are just abbreviations for
persist(StorageLevel.MEMORY_ONLY).
Warning: Once the storage level has been changed, it cannot be changed again!
Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)
Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.
It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..
Listing Variants...
def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]
See below example :
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Note :
Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.
See more visually here....
Persist in memory and disk:
Cache:
Caching can improve the performance of your application to a great extent.
In general, it is recommended to use persist with a specific storage level to have more control over caching behavior, while cache can be used as a quick and convenient way to cache data in memory.
There is no difference. From RDD.scala.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
Spark gives 5 types of Storage level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).
By default persist() will
store the data in the JVM heap as unserialized objects.
Cache() and persist() both the methods are used to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages.
The only difference between cache() and persist() is ,using Cache technique we can save intermediate results in memory only when needed while in Persist() we can save the intermediate results in 5 storage levels(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY).
For impatient:
Same
Without passing argument, persist() and cache() are the same, with default settings:
when RDD: MEMORY_ONLY
when Dataset: MEMORY_AND_DISK
Difference:
Unlike cache(), persist() allows you to pass argument inside the bracket, in order to specify the level:
persist(MEMORY_ONLY)
persist(MEMORY_ONLY_SER)
persist(MEMORY_AND_DISK)
persist(MEMORY_AND_DISK_SER )
persist(DISK_ONLY )
Voilà!