Why does Spark shuffle store intermediate data on disk? - apache-spark

Why does spark store intermediate data on disk during shuffle? I am trying to understand why it cannot store in memory. What are the challenges to write to memory?
Is some work being done to write it to Memory?

Spark stores intermediate data on disk from a shuffle operation as part of its "under-the-hood" optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted.
The source of this answer is the O'Reilly book Learning Spark by Karau, Konwinski, Wendell & Zaharia. Chapter 8: Tuning and Debugging Spark. Section: Components of Execution: Jobs, Tasks, and Stages.

Related

Why is Spark's read from file after a stage so fast?

Spark materializes its results on disk after a shuffle. While running an experiment, I saw that a task of Spark read materialized data of 65MB in 1ms (some tasks were even showed to read this in 0ms :)). My question is how can Spark read data from HDD so fast? Is it actually reading this data from a file or from memory?
The answer by #zero323 on this Stackoverflow post states To disk are written shuffle files. It doesn't mean that data after the shuffle is not kept in memory. But I couldn't find any official Spark source that says that Spark keeps shuffle output in memory which is preferred while reading by the next task.
Is the Spark task reading shuffle output from disk or from memory (if from memory, I would be thankful if someone can point to an official source).
Spark shuffle outputs are written to disk. You can find this on Spark Documents on Performance Impact topic.
Shuffle also generates a large number of intermediate files on disk.
As of Spark 1.3, these files are preserved until the
corresponding RDDs are no longer used and are garbage collected.
This is done so the shuffle files don’t need to be re-created if the
lineage is re-computed. Garbage collection may happen only after a
long period time, if the application retains references to these RDDs
or if GC does not kick in frequently.
This means that long-running Spark jobs may consume a large amount of
disk space.

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
rdd.persist(StorageLevel.MEMORY_DISK_2)
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Misunderstanding of spark RDD fault tolerant

Many say:
Spark does not replicate data in hdfs.
Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph.
So there is no need of data replication as the RDDs can be recalculated from the lineage graph.
And my question is:
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?
What if we lose something part way through computation?
Rely on the key insight from MR! Determinism provides safe recompute.
Track 'lineage' of each RDD. Can recompute from parents if needed.
Interesting: only need to record tiny state to do recompute.
Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB
Source
The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?
The core idea is that you can use the lineage to recover lost RDDs because RDDs are
built from another RDD or
built from data in stable storage.
(source: RDD paper, beginning of section 2.1)
If some RDD is lost, you can just go back in the lineage until you reach some RDD or the initial data record that is still available.
The data in stable storage is replicated across multiple nodes, therefore unlikely to be lost.
As far from what I've read about Streaming Receivers, the received data seems to be saved in stable storage as well, so it behaves just like any other data source.

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Does Spark write intermediate shuffle outputs to disk

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:
Spark’s internal scheduler may truncate the lineage of the RDD graph
if an existing RDD has already been persisted in cluster memory or on
disk. A second case in which this truncation can happen is when an RDD
is already materialized as a side effect of an earlier shuffle, even
if it was not explicitly persisted. This is an under-the-hood
optimization that takes advantage of the fact that Spark shuffle
outputs are written to disk, and exploits the fact that many times
portions of the RDD graph are recomputed.
As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.
When and why will a shuffle persist something on disk? How can that be reused by further computations?
When
It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled
Why
This is an optimization. Shuffling is one of the expensive things that happen in Spark.
How that can be reused by further computations?
It is automatically reused with any subsequent action executed on the same RDD.

Resources