Where does Spark actually persist RDDs on disk? - apache-spark

I am using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY.
I think there might be something wrong with my code... Where can I find the persisted RDDs on disk so that I can make sure they were actually persisted?

As per the doc:
spark.local.dir (by default /tmp)
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Two possible reasons for your observation:
RDDs are persisted in a lazy fashion, therefore, to make it work you should call an action(e.g. count()) on it after you call persist()
Even if you make sure the persist() happens, the actual data may not write to disk actually, your write method is returned directly after the data is write into buffer cache, therefore, when you read it next to write, it simply return the cached data.
So, Did persist happens?
Did you clear linux Buffer cache on each node after persist rdd as DISK_ONLY, before operate on it and measure performance?
So what I suggest you to do is:
persist rdd as DISK_ONLY, invoke an action(e.g. count()), to make it persist.
sleep the application for a few seconds, clear the cache of all the worker node during this period
sync && echo 3 > /proc/sys/vm/drop_caches
resume your procedure, and measure the performance of persisted RDD.


RDD in Spark: where and how are they stored?

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.
I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:
I read a file -> I get the RDD_ONE
I use the map on the RDD_ONE -> I get the RDD_TWO
I use any other transformation on the RDD_TWO
if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?
if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?
In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.
Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.
Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.
If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.
Few pointers you can consider while caching:
Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes
To Answer your question:
Q1:if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)? Ans: Considering the data which is available in workers node as blocks in HDFS, when creating rdd for the file as
val rdd=sc.textFile("<HDFS Path>")
the underlying blocks of data from each nodes (HDFS) will be loaded to their RAM's(i,e memory) as partitions (in spark, the blocks of hdfs data are called as partitions once loaded into memory)
Q2: if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE? Ans: Yes.Since the underlying results are not stored in drivers memory by using cache() in this scenario.

Differences between persist(DISK_ONLY) vs manually saving to HDFS and reading back

This answer clearly explains RDD persist() and cache() and the need for it - (Why) do we need to call cache or persist on a RDD
So, I understand that calling someRdd.persist(DISK_ONLY) is lazy, but someRdd.saveAsTextFile("path") is eager.
But other than this (also disregarding the cleanup of text file stored in HDFS manually), is there any other difference (performance or otherwise) between using persist to cache the rdd to disk versus manually writing and reading from disk?
Is there a reason to prefer one over the other?
More Context: I came across code which manually writes to HDFS and reads it back in our production application. I've just started learning Spark and was wondering if this can be replaced with persist(DISK_ONLY). Note that the saved rdd in HDFS is deleted before every new run of the job and this stored data is not used for anything else between the runs.
There are at least these differences:
Writing to HDFS will have the replicas overhead, while caching is written locally on the executor (or to second replica if DISK_ONLY_2 is chosen).
Writing to HDFS is persistent, while cached data might get lost if/when an executor is killed for any reason. And you already mentioned the benefit of writing to HDFS when the entire application goes down.
Caching does not change the partitioning, but reading from HDFS might/will result in different partitioning than the original written DataFrame/RDD. For example, small partitions (files) will be aggregated and large files will be split.
I usually prefer to cache small/medium data sets that are expensive to evaluate, and write larger data sets to HDFS.

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Should you always unpersist earlier cached rdds after you cache an rdd that appears later in the same lineage graph?

I have an rdd that I cache after loading data from s3, since I don't want to have to re-pull from s3 if I lose an executor. I then make a bunch of transformations on that rdd, and then cache again.
At this point, is there any reason to leave the first cached rdd in the cache? Will all later stages just pull from the more recently cached transformation if I don't use the earlier rdd again?
I don't want to have to re-pull from s3 if I lose an executor.
Default caching variants don't protect you from executor loss. Spark provides replicated cache options (MEMORY_ONLY_SER_2, MEMORY_AND_DISK_SER_2, DISK_ONLY_2) which add some protection in case of node failure, but there more expensive than non-replicated variants.
is there any reason to leave the first cached rdd in the cache?
If the second one has been materialized then there is no reason to keep the first one, but LRU cleaner should be able to do handle this case without your help, if it is necessary.

Is Spark RDD cached on worker node or driver node (or both)?

Can any one please correct my understanding on persisting by Spark.
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially.
Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes.
So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
Am I correct?
Is it something that the RDD value is persisted in driver memory and not on nodes ?
Change this:
then Spark is going to cache its value only in first or second worker nodes.
to this:
then Spark is going to cache its value only in first and second worker nodes.
and...Yes correct!
Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).
You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).
That's why you want to unpersist() your RDD, when you no longer need it...! :)
Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:
RDD.cache is a lazy operation. it does nothing until unless you call an action like count. Once you call the action the operation will use the cache. It will just take the data from the cache and do the operation.
RDD.cache- Persists the RDD with default storage level (Memory only).
2.Is it something that the RDD value is persisted in driver memory and not on nodes ?
RDD can be persisted to disk and Memory as well . Click on the link to Spark document for all the option
Spark Rdd Persist
# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache function can only be applied to an RDD (refer), and since RDD only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count which brings back the result to the driver, it's not really an RDD anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache in the above example was called on rdd2 which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1 again, it won't read the JSON again, because it was cached
FYI, I am using the word memory based on the assumption that the caching level is set to MEMORY_ONLY. Of course, if that level is changed to others, Spark will cache to either memory or storage based on the setting
Here is an excellent answer on caching
(Why) do we need to call cache or persist on a RDD
Basically caching stores the RDD in the memory / disk (based on persistence level set) of that node, so that the when this RDD is called again it does not need to recompute its lineage (lineage - Set of prior transformations executed to be in the current state).
