Spark - How to move persisted data from disk to cache? - apache-spark

I run into a pretty simple but frustrating pattern when doing adhoc data analysis:
You have rdd1 cached in memory, then cache rdd2 in memory, which evicts rdd1 to disk because of memory constraints. If you were to unpersist rdd2, is there anyway to tell spark to move rdd1 back to memory?

I dont think it is possible to instruct spark to bring the rdd1 back to memory.
But, next time the rdd1 is accessed, it gets loaded into memory, given that you are using MEMORY_AND_DISK persistence level (otherwise it gets recomputed).
If you are looking to reduce space and load in memory, consider using MEMORY_AND_DISK_SER, this will serialize your objects and store.

Related

How does Spark handle partial cache/persist results?

If you cache a very large dataset that cannot be all stored in memory or on disk how does spark handle the partial cache? how does it know which data needs to be recomputed when you go to use that dataframe again?
Example:
Read 100 GB dataset into memory df1
Compute new dataframe df2 based on df1
cache df2
If spark can only fit 50GB of Cache for df2 what happens if you go to reuse df2 for the next steps? How would spark know which data it doesn't need to recompute and which is does? Will it need to re-read that data again that it couldn't persist?
UPDATE
What happens if you have 5GB memory and 5GB disk and try to cache a 20GB dataset? What happens to the other 10GB of data that can't be cached and how does spark know which data it needs to recompute and which it doesn't?
Spark has this default option for DF and DS:
MEMORY_AND_DISK – This is the default behavior of the DataFrame or
Dataset. In this Storage Level, The DataFrame will be stored in JVM
memory as a deserialized objects. When required storage is greater
than available memory, it stores some of the excess partitions into
local disk and reads the data from local disk when it required. It is slower as
there is I/O involved.
However, to be more specific:
Spark's unit of processing is a partition = 1 task. So the discussion
is more about partition or partitions fitting into memory and/or local
disk.
If a partition of the DF doesn't fit in memory and disk when using
StorageLevel.MEMORY_AND_DISK, then the OS will fail, aka kill, the
Executor / Worker. Eviction of other partitions than your own DF may occur, but
not for your own DF. The .cache is either successful or not, there is no re- reading in this case.
I base this on the fact that partition eviction does not occur for partitions belonging to the same underlying RDD. Not well explained all this stuff, but see here: How does Spark evict cached partitions?. In the end other RDD partitions may be evicted and re-computed but in the end also yo need enough local disk as well as memory.
A good read is: https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/

When will Spark clean the cached RDDs automatically?

The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.
That means it will consume some part of the ram being available for the Spark process itself.
Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?
Spark will clean cached RDDs and Datasets / DataFrames:
When it is explicitly asked to by calling RDD.unpersist (How to uncache RDD?) / Dataset.unpersist methods or Catalog.clearCache.
In regular intervals, by the cache cleaner:
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
When corresponding distributed data structure is garbage collected.
Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.
From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘
to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.

Is Spark RDD cached on worker node or driver node (or both)?

Can any one please correct my understanding on persisting by Spark.
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially.
Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes.
So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
Am I correct?
(OR)
Is it something that the RDD value is persisted in driver memory and not on nodes ?
Change this:
then Spark is going to cache its value only in first or second worker nodes.
to this:
then Spark is going to cache its value only in first and second worker nodes.
and...Yes correct!
Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).
You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).
That's why you want to unpersist() your RDD, when you no longer need it...! :)
Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:
RDD.cache is a lazy operation. it does nothing until unless you call an action like count. Once you call the action the operation will use the cache. It will just take the data from the cache and do the operation.
RDD.cache- Persists the RDD with default storage level (Memory only).
Spark RDD API
2.Is it something that the RDD value is persisted in driver memory and not on nodes ?
RDD can be persisted to disk and Memory as well . Click on the link to Spark document for all the option
Spark Rdd Persist
# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
rdd2=rdd2.map(mysecondfunc)
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
n=rdd1.count()
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache function can only be applied to an RDD (refer), and since RDD only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count which brings back the result to the driver, it's not really an RDD anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache in the above example was called on rdd2 which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1 again, it won't read the JSON again, because it was cached
FYI, I am using the word memory based on the assumption that the caching level is set to MEMORY_ONLY. Of course, if that level is changed to others, Spark will cache to either memory or storage based on the setting
Here is an excellent answer on caching
(Why) do we need to call cache or persist on a RDD
Basically caching stores the RDD in the memory / disk (based on persistence level set) of that node, so that the when this RDD is called again it does not need to recompute its lineage (lineage - Set of prior transformations executed to be in the current state).

Storing intermediate data in Spark when there are 100s of operations in an application

An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data?
I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)
While cache() and persist() is generic checkpoint is something which is specific to streaming.
caching - caching might happen on memory or disk
rdd.cache()
persist - you can give option where you want to persist your data either in memory or disk
rdd.persist(storage level)
checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)
val ssc = new StreamingContext(...) // new context
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
There is a significant difference between cache/persist and checkpoint.
Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.
However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html
(Why) do we need to call cache or persist on a RDD

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
Thanks!
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Note:
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

Resources