Is there config for StorageLevel in Spark? [duplicate] - apache-spark

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.

I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.

Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

Related

Is it possible to set the default storage level in Spark?

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?
There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.
I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
...
}
}
Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.
Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:
-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.
-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use
it, it has to be deserialized at a cost.
-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.
-DISK_ONLY: Data is serialized and stored on disk.
-OFF_HEAP: Data is stored off-heap.
-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
for example, you can persist your data like this:
val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it.
Does anyone know why the default persistence levels are different?
Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.
Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.
The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):
Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
As mentioned by #user6910411 "Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core." i.e dataset/dataframe apis use column buffers to store the column datattype and column details about the raw data so in case while caching the data does not fit in to memory then it will not cache the rest of the partition and will recompute whenever needed.So in the case of dataset/dataframe the recomputation cost is more compared to rdd due to its columnar structure.So the default persist option changed to MEMORY_AND_DISK so that the blocks that does not fit in to memory will spill to disk and it will retrieved from disk whenever needed rather than recomputing next time.

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
rdd.persist(StorageLevel.MEMORY_DISK_2)
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

How to access cached data in Spark Streaming application?

I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.
I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.
Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).
Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?
It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).
But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.
I think for your use case mapWithState operator is exactly what you're after.
Spark does not work that way. Please think it through in a distributed way.
For the first part of keeping in RAM. You can use cache() or persist() anyone as by default they keep data in memory, of the worker.
You can verify this from Apache Spark Code.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
As far as I understand your use case, you need the UpdateStateByKey Operation to implement your second use case !
For more on Windowing see here.

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?
With cache(), you use only the default storage level :
MEMORY_ONLY for RDD
MEMORY_AND_DISK for Dataset
With persist(), you can specify which storage level you want for both RDD and Dataset.
From the official docs:
You can mark an RDD to be persisted using the persist() or cache() methods on it.
each persisted RDD can be stored using a different storage level
The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
Use persist() if you want to assign a storage level other than :
MEMORY_ONLY to the RDD
or MEMORY_AND_DISK for Dataset
Interesting link for the official documentation : which storage level to choose
The difference between cache and persist operations is purely
syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e.
cache is merely persist with the default storage level MEMORY_ONLY
But Persist()
We can save the intermediate results in 5 storage levels.
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def persist(): this.type =
persist(StorageLevel.MEMORY_ONLY)
/** * Persist this RDD with the default storage level
(MEMORY_ONLY). */
def cache(): this.type = persist()
see more details here...
Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storage like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
#persist, cache
These functions can be used to adjust the storage level of a RDD.
When freeing up memory, Spark will use the storage level identifier to
decide which partitions should be kept. The parameter less variants
persist() and cache() are just abbreviations for
persist(StorageLevel.MEMORY_ONLY).
Warning: Once the storage level has been changed, it cannot be changed again!
Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)
Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.
It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..
Listing Variants...
def cache(): RDD[T]
def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]
See below example :
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Note :
Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.
See more visually here....
Persist in memory and disk:
Cache:
Caching can improve the performance of your application to a great extent.
In general, it is recommended to use persist with a specific storage level to have more control over caching behavior, while cache can be used as a quick and convenient way to cache data in memory.
There is no difference. From RDD.scala.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
Spark gives 5 types of Storage level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).
By default persist() will
store the data in the JVM heap as unserialized objects.
Cache() and persist() both the methods are used to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages.
The only difference between cache() and persist() is ,using Cache technique we can save intermediate results in memory only when needed while in Persist() we can save the intermediate results in 5 storage levels(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY).
For impatient:
Same
Without passing argument, persist() and cache() are the same, with default settings:
when RDD: MEMORY_ONLY
when Dataset: MEMORY_AND_DISK
Difference:
Unlike cache(), persist() allows you to pass argument inside the bracket, in order to specify the level:
persist(MEMORY_ONLY)
persist(MEMORY_ONLY_SER)
persist(MEMORY_AND_DISK)
persist(MEMORY_AND_DISK_SER )
persist(DISK_ONLY )
Voilà!

Resources