Different default persist for RDD and Dataset - apache-spark

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it.
Does anyone know why the default persistence levels are different?

Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.
Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.
The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):
Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.

For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
As mentioned by #user6910411 "Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core." i.e dataset/dataframe apis use column buffers to store the column datattype and column details about the raw data so in case while caching the data does not fit in to memory then it will not cache the rest of the partition and will recompute whenever needed.So in the case of dataset/dataframe the recomputation cost is more compared to rdd due to its columnar structure.So the default persist option changed to MEMORY_AND_DISK so that the blocks that does not fit in to memory will spill to disk and it will retrieved from disk whenever needed rather than recomputing next time.


how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

Why is spark MEMORY_AND_DISK slower than MEMORY_ONLY?

I have a pretty typical RDD scenario where I gather some data, persist it, and then use the persisted RDD multiple times for various transforms. Persisting speeds things up by an order of magnitude, so persisting is definitely warranted.
But I'm surprised at the relative speed of the different methods of persisting. If I persist using MEMORY_AND_DISK, each subsequent use of the persisted RDD takes about 10% longer than if I use MEMORY_ONLY. Why is that? I would have expected them to have the same speed if the data fits in memory, and I expected MEMORY_AND_DISK to be faster if some partitions don't fit in memory. Why do my timings consistently not show that to be true?
Your CPU typically access memory at around 10 Gb/s whereas an access to an SSD takes 600Mb/s
The partitions that don't fit into memory when MEMORY_ONLY is chosen are recomputed using the parent rdds partitionning. If you have no wide dependency that should be ok
It is impossible to tell without the context, but there are at least two cases where MEMORY_AND_DISK:
Data is larger than available memory - with MEMORY_AND_DISK partitions that doesn't fit in memory will be stored on disk.
Partitions have been evicted from memory - with MEMORY_AND_DISK there are stored on disk, with MEMORY_ONLY there are lost and have to be recomputed and eviction might trigger large GC sweep.
Finally you have to remember that _DISK can use different levels of hardware and software caching so different block might be accessed with a speed comparable to the main memory.

How does Spark evict cached partitions?

I'm running Spark 2.0 in stand-alone mode, and I'm the only one submitting jobs in my cluster.
Suppose I have an RDD with 100 partitions and only 10 partitions in total would fit in memory at a time.
Let's also assume that allotted execution memory is enough and will not interfere with storage memory.
Suppose I iterate over the data in that RDD.
rdd.persist() // MEMORY_ONLY
for (_ <- 0 until 10) {
For each iteration, will the first 10 partitions that are persisted always be in memory until rdd.unpersist()?
For now what I know Spark is using LRU (Less Recently Used) eviction strategy for RDD partitions as a default. They are working on adding new strategies.
This strategy remove an element which is less recently used The last used timestamp is updated when an element is put into the cache or an element is retrieved from the cache.
I suppose you will always have 10 partition in your memory, but which are stored in memory and which will get evicted depends on their use. According Apache FAQ:
Likewise, cached datasets that do not fit in memory are either spilled
to disk or recomputed on the fly when needed, as determined by the
RDD's storage level.
Thus, it depends on your configuration if other partitions are spilled to disk or recomputed on the fly. Recomputation is the default, which is not always most efficient option. You can set a dataset's storage level to MEMORY_AND_DISK to be able to avoid this.
I think I found the answer, so I'm going to answer my own question.
The eviction policy seems to be in the MemoryStore class. Here's the source code.
It seems that entries are not evicted to make place for entries in the same RDD.

Storing intermediate data in Spark when there are 100s of operations in an application

An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data?
I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)
While cache() and persist() is generic checkpoint is something which is specific to streaming.
caching - caching might happen on memory or disk
persist - you can give option where you want to persist your data either in memory or disk
rdd.persist(storage level)
checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)
val ssc = new StreamingContext(...) // new context
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
There is a significant difference between cache/persist and checkpoint.
Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.
However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
(Why) do we need to call cache or persist on a RDD

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?
