how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory? - apache-spark

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance

There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
2)Persist (MEMORY_AND_DISK)
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
3)Persist (MEMORY_ONLY_SER)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
4)Persist (MEMORY_AND_DISK_SER)
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2)
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

Related

Why is spark MEMORY_AND_DISK slower than MEMORY_ONLY?

I have a pretty typical RDD scenario where I gather some data, persist it, and then use the persisted RDD multiple times for various transforms. Persisting speeds things up by an order of magnitude, so persisting is definitely warranted.
But I'm surprised at the relative speed of the different methods of persisting. If I persist using MEMORY_AND_DISK, each subsequent use of the persisted RDD takes about 10% longer than if I use MEMORY_ONLY. Why is that? I would have expected them to have the same speed if the data fits in memory, and I expected MEMORY_AND_DISK to be faster if some partitions don't fit in memory. Why do my timings consistently not show that to be true?
Your CPU typically access memory at around 10 Gb/s whereas an access to an SSD takes 600Mb/s
The partitions that don't fit into memory when MEMORY_ONLY is chosen are recomputed using the parent rdds partitionning. If you have no wide dependency that should be ok
It is impossible to tell without the context, but there are at least two cases where MEMORY_AND_DISK:
Data is larger than available memory - with MEMORY_AND_DISK partitions that doesn't fit in memory will be stored on disk.
Partitions have been evicted from memory - with MEMORY_AND_DISK there are stored on disk, with MEMORY_ONLY there are lost and have to be recomputed and eviction might trigger large GC sweep.
Finally you have to remember that _DISK can use different levels of hardware and software caching so different block might be accessed with a speed comparable to the main memory.

where does df.cache() is stored

I would like to understand in which node (driver or worker/executor) does below code is stored
df.cache() //df is a large dataframe (200GB)
And which has a better performance: using sql cachetable or cache(). My understanding is that one of them is lazy and the other is eager.
df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level
The persist() method calls
sparkSession.sharedState.cacheManager.cacheQuery()
and when you see the code for cacheTable it also calls the same
sparkSession.sharedState.cacheManager.cacheQuery()
that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level
NONE
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP
You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM
Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager
See here for details!
You can choose as per your requirement!
Hope this helps!
Just adding my 25 cents.
A SparkDF.cache() would load the data in executor memory.
It will not load in driver memory. Which is what's desired.
Here's a snapshot of 50% of data load post a df.cache().count() I just ran.
Cache() persists in memory and disk as delineated by koiralo, and is also lazy evaluated.
Cachedtable() stores on disk and is resilient to node failures for this reason.
Credit: https://forums.databricks.com/answers/63/view.html
The cache (or persist) method marks the DataFrame for caching in memory (or disk, if necessary, as the other answer says), but this happens only once an action is performed on the DataFrame, and only in a lazy fashion, i.e., if you ultimately read only 100 rows, only those 100 rows are cached. Creating a temporary table and using cacheTable is eager in the sense that it will cache the entire table immediately. Which is more performant depends on your situation. One thing that I've done with ordinary DataFrame cache is to immediately call .count() right after, forcing the DataFrame to be cached, and obviating the need to register a temp table and such.
Spark Memory. this is the memory pool managed by Apache Spark. Its size can be calculated as (“Java Heap” – “Reserved Memory”) * spark.memory.fraction, and with Spark 1.6.0 defaults it gives us (“Java Heap” – 300MB) * 0.75. For example, with 4GB heap this pool would be 2847MB in size. This whole pool is split into 2 regions – Storage Memory and Execution Memory, and the boundary between them is set by spark.memory.storageFraction parameter, which defaults to 0.5. The advantage of this new memory management scheme is that this boundary is not static, and in case of memory pressure the boundary would be moved, i.e. one region would grow by borrowing space from another one. I would discuss the “moving” this boundary a bit later, now let’s focus on how this memory is being used:
1. Storage Memory. This pool is used for both storing Apache Spark cached data and for temporary space serialized data “unroll”. Also all the “broadcast” variables are stored there as cached blocks. In case you’re curious, here’s the code of unroll. As you may see, it does not require that enough memory for unrolled block to be available – in case there is not enough memory to fit the whole unrolled partition it would directly put it to the drive if desired persistence level allows this. As of “broadcast”, all the broadcast variables are stored in cache with MEMORY_AND_DISK persistence level.
2. Execution Memory. This pool is used for storing the objects required during the execution of Spark tasks. For example, it is used to store shuffle intermediate buffer on the Map side in memory, also it is used to store hash table for hash aggregation step. This pool also supports spilling on disk if not enough memory is available, but the blocks from this pool cannot be forcefully evicted by other threads (tasks).
Ok, so now let’s focus on the moving boundary between Storage Memory and Execution Memory. Due to nature of Execution Memory, you cannot forcefully evict blocks from this pool, because this is the data used in intermediate computations and the process requiring this memory would simply fail if the block it refers to won’t be found. But it is not so for the Storage Memory – it is just a cache of blocks stored in RAM, and if we evict the block from there we can just update the block metadata reflecting the fact this block was evicted to HDD (or simply removed), and trying to access this block Spark would read it from HDD (or recalculate in case your persistence level does not allow to spill on HDD).
So, we can forcefully evict the block from Storage Memory, but cannot do so from Execution Memory. When Execution Memory pool can borrow some space from Storage Memory? It happens when either:
There is free space available in Storage Memory pool, i.e. cached blocks don’t use all the memory available there. Then it just reduces the Storage Memory pool size, increasing the Execution Memory pool.
Storage Memory pool size exceeds the initial Storage Memory region size and it has all this space utilized. This situation causes forceful eviction of the blocks from Storage Memory pool, unless it reaches its initial size.
In turn, Storage Memory pool can borrow some space from Execution Memory pool only if there is some free space in Execution Memory pool available.
Initial Storage Memory region size, as you might remember, is calculated as “Spark Memory” * spark.memory.storageFraction = (“Java Heap” – “Reserved Memory”) * spark.memory.fraction * spark.memory.storageFraction. With default values, this is equal to (“Java Heap” – 300MB) * 0.75 * 0.5 = (“Java Heap” – 300MB) * 0.375. For 4GB heap this would result in 1423.5MB of RAM in initial Storage Memory region.
reference -https://0x0fff.com/spark-memory-management/

How does spark behave without enough memory (RAM) to create RDD

When I do sc.textFile("abc.txt")
Spark creates RDD in RAM (memory).
So does the cluster collective memory should be greater than size of the file “abc.txt”?
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
How to work on big data which doesn’t fit into memory?
When I do sc.textFile("abc.txt") Spark creates RDD in RAM (memory).
The above point is not certainly true. In Spark, their is something called transformations and something called actions. sc.textFile("abc.txt") is transformation operation and it does not simply load data straight away unless you trigger any action eg count().
To give you a collective answer to your all questions, I would urge you to understand how spark execution works. Their is something called logical and physical plans.As part of physical plan, it does cost calculation(available resource calculation across the cluster(s)) before it starts the jobs. if you understand them, you will get clear idea on all your questions.
You first assumption is incorrect:
Spark creates RDD in RAM (memory).
Spark doesn't create RDDs "in-memory". It uses memory but it is not limited to in-memory data processing. So:
So does the cluster collective memory should be greater than size of the file “abc.txt”?
No
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
No special steps are required.
How to work on big data which doesn’t fit into memory?
See above.

How does Spark evict cached partitions?

I'm running Spark 2.0 in stand-alone mode, and I'm the only one submitting jobs in my cluster.
Suppose I have an RDD with 100 partitions and only 10 partitions in total would fit in memory at a time.
Let's also assume that allotted execution memory is enough and will not interfere with storage memory.
Suppose I iterate over the data in that RDD.
rdd.persist() // MEMORY_ONLY
for (_ <- 0 until 10) {
rdd.map(...).reduce(...)
}
rdd.unpersist()
For each iteration, will the first 10 partitions that are persisted always be in memory until rdd.unpersist()?
For now what I know Spark is using LRU (Less Recently Used) eviction strategy for RDD partitions as a default. They are working on adding new strategies.
https://issues.apache.org/jira/browse/SPARK-14289
This strategy remove an element which is less recently used The last used timestamp is updated when an element is put into the cache or an element is retrieved from the cache.
I suppose you will always have 10 partition in your memory, but which are stored in memory and which will get evicted depends on their use. According Apache FAQ:
Likewise, cached datasets that do not fit in memory are either spilled
to disk or recomputed on the fly when needed, as determined by the
RDD's storage level.
Thus, it depends on your configuration if other partitions are spilled to disk or recomputed on the fly. Recomputation is the default, which is not always most efficient option. You can set a dataset's storage level to MEMORY_AND_DISK to be able to avoid this.
I think I found the answer, so I'm going to answer my own question.
The eviction policy seems to be in the MemoryStore class. Here's the source code.
It seems that entries are not evicted to make place for entries in the same RDD.

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
Thanks!
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Note:
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

Resources