Spark checkpointing behaviour - apache-spark

Does Spark use checkpoints when we start a new job? Let's say we used a checkpoint to write some RDD to a disk. Will the said RDD be recalculated or loaded from the disk during a new job?

In addition to the points given by #maxime G...
Spark Does not offer default checkpointing .. we need to explicitly set it.
Checkpointing is actually a feature of Spark Core (that Spark SQL uses
for distributed computations) that allows a driver to be restarted on
failure with previously computed state of a distributed computation
described as an RDD
Spark offers two varieties of checkpointing.
Reliable checkpointing: Reliable checkpointing uses reliable data storage like Hadoop HDFS OR S3. and you can achieve by simply doing
sparkContext.setCheckpointDir("(hdfs:// or s3://)tmp/checkpoint/")
then dataframe.checkpoint(eager = true)
and Nonreliable checkpointing: which is Local checkpointing uses executor storage (i.e node-local disk storage) to write checkpoint files to and due to the executor lifecycle is considered unreliable and it does not promise data to be available if the job terminates abruptly.
sparkContext.setCheckpointDir("/tmp/checkpoint/").
dataframe.localCheckpoint(eager = true)
(Be careful when you are checkpointing in local mode and cluster autoscaling is enabled..)
Note:
Checkpointing can be eager or lazy per eager flag of the checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed.
The eager checkpoint will create an immediate stage barrier and later one wait for any particular action to happen and remember all previous transformations.

at the start of the job, if a RDD is present in your checkpoint location, it will be loaded.
That also mean that if you change code, you should also be careful about checkpointing because a RDD with old code is loaded with new code and that can cause conflict.

Related

What is the difference between spark checkpoint and local checkpoint?

What is the difference between spark checkpoint and local checkpoint? When making local checkpoint I see this in the spark UI:
It shows that local checkpoint is saved on memory.
Local checkpoint stores your data in executors storage (as shown in your screenshot).
It is useful for truncating the lineage graph of an RDD, however, in case of node failure you will lose the data and you need to recompute it (depending on your application you may have to pay a high price).
'Standard' checkpoint stores your data in a reliable file system (like hdfs). It is more expensive to perform but you will not need to recompute the data even in case of failures. Of course, it truncates the lineage graph.
Truncating a long lineage graph avoid getting stack overflow exceptions and is particularly useful in iterative algorithms
local checkpointing writes data in executors storage
regular checkpointing writes data in HDFS
local checkpointing is faster than classic checkpointing but regular checkpointing is safer in that it leverages HDFS reliability (e.g. data blocks replication).

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

How can I understand check point recorvery when using Kafka Direct InputDstream and stateful stream transformation?

On yarn-cluster I use kafka directstream as input(ex.batch time is 15s),and want to aggregate the input msg in seperate userIds.
So I use stateful streaming api like updateStateByKey or mapWithState.But from the api source,I see that the mapWithState's default checkpoint duration is batchduration * 10 (in my case 150 s),and in kafka directstream the partition offset is checkpointed at every batch(15 s).Actually,every dstream can set different checkpoint duration.
So, my question is:
When streaming app crashed,I restart it,the kafka offset and state stream rdd are asynchronous in checkpoint,in this case how can I keep no data lose? Or I misunderstand the checkpoint mechanism?
How can I keep no data lose?
Stateful streams such as mapWithState or updateStateByKey require you to provide a checkpoint directory because that's part of how they operate, they store the state every intermediate to be able to recover the state upon a crash.
Other than that, each DStream in the chain is free to request checkpointing as well, question is "do you really need to checkpoint other streams"?
If an application crashes, Spark takes all the state RDDs stored inside the checkpoint and brings then back to memory, so your data there is as good as it was the last time spark checkpointed it there. One thing to keep in my mind is, if you change your application code, you cannot recover state from checkpoint, you'll have to delete it. This means that if for instance you need to do a version upgrade, all data that was previously stored in the state will be gone unless you manually save it yourself in a manner which allows versioning.

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

How to enable lineage-based fault tolerance for Spark-Tachyon integration?

I am trying to implement RDD/Dataframe sharing using Tachyon. It is my understanding that with HDFS underFS, writes are asynchronous (with replication to HDFS happening behind the scene) and therefore should be faster but in my testing I see that Tachyon with HDFS underFS is 2-6 times slower at writing.
From this Tachyon paper I see that:
"We made [lineage-based fault tolerance] configurable in our Spark and MapReduce integration"
How do you enable Spark to use lineage-based fault tolerance in Tachyon?
Note: I am using the Spark Dataframe method, df.write.parquet, and the RDD method, rdd.saveAsObjectFile, to save my Dataframes/RDDs to Tachyon.
You should set tachyon.user.lineage.enabled to true and adjust other lineage settings according to your preferences. Some of the most interesting settings (from the Master Configuration docs):
tachyon.master.lineage.checkpoint.interval.ms - The interval (in milliseconds) between Tachyon's checkpoint scheduling.
tachyon.master.lineage.checkpoint.class - The class name of the checkpoint strategy for lineage output files. The default strategy is to checkpoint the latest completed lineage, i.e. the lineage whose output files are completed.
tachyon.master.lineage.recompute.interval.ms - The interval (in milliseconds) between Tachyon's recompute execution. The executor scans the all the lost files tracked by lineage, and re-executes the corresponding jobs. every 10 minutes.
See Lineage API docs for more details.

Resources