How does Spark handle partial cache/persist results? - apache-spark

If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?
Example:
Read 100 GB dataset into memory df1
Compute new dataframe df2 based on df1
cache df2
If spark can only fit 50GB of Cache for df2 what happens if you go to reuse df2 for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?
UPDATE
What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?

Spark has this default option for DF and DS:
MEMORY_AND_DISK – This is the default behavior of the DataFrame or
Dataset. In this Storage Level, The DataFrame will be stored in JVM
memory as a deserialized objects. When required storage is greater
than available memory, it stores some of the excess partitions into
local disk and reads the data from local disk when it required. It is slower as
there is I/O involved.
However, to be more specific:
Spark's unit of processing is a partition = 1 task. So the discussion
is more about partition or partitions fitting into memory and/or local
disk.
If a partition of the DF doesn't fit in memory and disk when using
StorageLevel.MEMORY_AND_DISK, then the OS will fail, aka kill, the
Executor / Worker. Eviction of other partitions than your own DF may occur, but
not for your own DF. The .cache is either successful or not, there is no re- reading in this case.
I base this on the fact that partition eviction does not occur for partitions belonging to the same underlying RDD. Not well explained all this stuff, but see here: How does Spark evict cached partitions?. In the end other RDD partitions may be evicted and re-computed but in the end also yo need enough local disk as well as memory.
A good read is: https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/

Related

How does spark read data behind the scenes?

I am slightly confused as to how does spark reads the data from s3 for example. Let's say there is 100 GB of data to be read from s3 and the spark cluster has a total memory of 30 GB. Will spark read all 100 GB of the data once an action is triggered and store the maximum number of partitions in memory and spill the rest to disk or will it read only the partitions that it can store in memory process them and then read the rest of the data? Any link to some documentation will be highly appreciated.
There is a question on Spark FAQ about this:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

how does Spark handle more memory than its capacity

Say my Spark cluster has 100G memory, during the Spark computing process, more data (new dataframes, caches) with a size of 200G are generated. In this case, will Spark store some of this data on Disk or it will just OOM?
Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
The persist method in Apache Spark provides six persist storage level to persist the data.
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER
(Java and Scala), MEMORY_AND_DISK_SER
(Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
The OFF_HEAP storage is under experimentation.

Apache Spark ---- how spark reads large partitions from source when there is no enough memory

Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon

Can Spark store part of a single RDD partition in memory and part on disk?

Per the title: can Spark store part of a single RDD/Dataset/DataFrame partition in memory and part on disk? In other words assuming the persistence level supports it, if a partition is too large to store in memory can it be partly held in memory and partly on disk?
My use case is that I that I want to write out very large Parquet files, and Spark's write behavior is to write out a file for each partition.
I'm afraid that's not possible in spark. The memory and disk options use the partition as the smallest data division.
According to the official documentation, if the MEMORY_AND_DISK storage level is used, partitions that do not fit memory are saved on the disk.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, store the partitions that don't fit on
disk, and read them from there when they're needed.
MEMORY_AND_DISK_SER has a similar behavior but Store RDD as serialized Java objects (one byte array per partition)
Perhaps you have some way to reduce the size of the partition instead. I think that could help.

How does Spark evict cached partitions?

I'm running Spark 2.0 in stand-alone mode, and I'm the only one submitting jobs in my cluster.
Suppose I have an RDD with 100 partitions and only 10 partitions in total would fit in memory at a time.
Let's also assume that allotted execution memory is enough and will not interfere with storage memory.
Suppose I iterate over the data in that RDD.
rdd.persist() // MEMORY_ONLY
for (_ <- 0 until 10) {
rdd.map(...).reduce(...)
}
rdd.unpersist()
For each iteration, will the first 10 partitions that are persisted always be in memory until rdd.unpersist()?
For now what I know Spark is using LRU (Less Recently Used) eviction strategy for RDD partitions as a default. They are working on adding new strategies.
https://issues.apache.org/jira/browse/SPARK-14289
This strategy remove an element which is less recently used The last used timestamp is updated when an element is put into the cache or an element is retrieved from the cache.
I suppose you will always have 10 partition in your memory, but which are stored in memory and which will get evicted depends on their use. According Apache FAQ:
Likewise, cached datasets that do not fit in memory are either spilled
to disk or recomputed on the fly when needed, as determined by the
RDD's storage level.
Thus, it depends on your configuration if other partitions are spilled to disk or recomputed on the fly. Recomputation is the default, which is not always most efficient option. You can set a dataset's storage level to MEMORY_AND_DISK to be able to avoid this.
I think I found the answer, so I'm going to answer my own question.
The eviction policy seems to be in the MemoryStore class. Here's the source code.
It seems that entries are not evicted to make place for entries in the same RDD.

Resources