How to enable lineage-based fault tolerance for Spark-Tachyon integration? - apache-spark

I am trying to implement RDD/Dataframe sharing using Tachyon. It is my understanding that with HDFS underFS, writes are asynchronous (with replication to HDFS happening behind the scene) and therefore should be faster but in my testing I see that Tachyon with HDFS underFS is 2-6 times slower at writing.
From this Tachyon paper I see that:
"We made [lineage-based fault tolerance] configurable in our Spark and MapReduce integration"
How do you enable Spark to use lineage-based fault tolerance in Tachyon?
Note: I am using the Spark Dataframe method, df.write.parquet, and the RDD method, rdd.saveAsObjectFile, to save my Dataframes/RDDs to Tachyon.

You should set tachyon.user.lineage.enabled to true and adjust other lineage settings according to your preferences. Some of the most interesting settings (from the Master Configuration docs):
tachyon.master.lineage.checkpoint.interval.ms - The interval (in milliseconds) between Tachyon's checkpoint scheduling.
tachyon.master.lineage.checkpoint.class - The class name of the checkpoint strategy for lineage output files. The default strategy is to checkpoint the latest completed lineage, i.e. the lineage whose output files are completed.
tachyon.master.lineage.recompute.interval.ms - The interval (in milliseconds) between Tachyon's recompute execution. The executor scans the all the lost files tracked by lineage, and re-executes the corresponding jobs. every 10 minutes.
See Lineage API docs for more details.

Related

Spark checkpointing behaviour

Does Spark use checkpoints when we start a new job? Let's say we used a checkpoint to write some RDD to a disk. Will the said RDD be recalculated or loaded from the disk during a new job?
In addition to the points given by #maxime G...
Spark Does not offer default checkpointing .. we need to explicitly set it.
Checkpointing is actually a feature of Spark Core (that Spark SQL uses
for distributed computations) that allows a driver to be restarted on
failure with previously computed state of a distributed computation
described as an RDD
Spark offers two varieties of checkpointing.
Reliable checkpointing: Reliable checkpointing uses reliable data storage like Hadoop HDFS OR S3. and you can achieve by simply doing
sparkContext.setCheckpointDir("(hdfs:// or s3://)tmp/checkpoint/")
then dataframe.checkpoint(eager = true)
and Nonreliable checkpointing: which is Local checkpointing uses executor storage (i.e node-local disk storage) to write checkpoint files to and due to the executor lifecycle is considered unreliable and it does not promise data to be available if the job terminates abruptly.
sparkContext.setCheckpointDir("/tmp/checkpoint/").
dataframe.localCheckpoint(eager = true)
(Be careful when you are checkpointing in local mode and cluster autoscaling is enabled..)
Note:
Checkpointing can be eager or lazy per eager flag of the checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed.
The eager checkpoint will create an immediate stage barrier and later one wait for any particular action to happen and remember all previous transformations.
at the start of the job, if a RDD is present in your checkpoint location, it will be loaded.
That also mean that if you change code, you should also be careful about checkpointing because a RDD with old code is loaded with new code and that can cause conflict.

What is the difference between spark checkpoint and local checkpoint?

What is the difference between spark checkpoint and local checkpoint? When making local checkpoint I see this in the spark UI:
It shows that local checkpoint is saved on memory.
Local checkpoint stores your data in executors storage (as shown in your screenshot).
It is useful for truncating the lineage graph of an RDD, however, in case of node failure you will lose the data and you need to recompute it (depending on your application you may have to pay a high price).
'Standard' checkpoint stores your data in a reliable file system (like hdfs). It is more expensive to perform but you will not need to recompute the data even in case of failures. Of course, it truncates the lineage graph.
Truncating a long lineage graph avoid getting stack overflow exceptions and is particularly useful in iterative algorithms
local checkpointing writes data in executors storage
regular checkpointing writes data in HDFS
local checkpointing is faster than classic checkpointing but regular checkpointing is safer in that it leverages HDFS reliability (e.g. data blocks replication).

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

Spark and RDD partitioning

As in spark we can load data directly from HDFS and number of partitions of RDD will be equal to number of partitions of file. HDFS as known for keeping duplicate chunks of files, so question is how spark deal with this and how RDD partition being governed.
Correct me if I went wrong in asking question.
You want to bring computation to data, so depending where the task will be performed (which physical node will keep the persistent data), you will use the closest available replica (same rack, etc) or perform the scheduling based on where the data is available. This part is handled by the YARN scheduler.
As you can check from spark user guide there are some configuration regarding the data locality that you can set (extracted from spark 1.6 user guide http://spark.apache.org/docs/latest/configuration.html ) :
spark.locality.wait
default : 3s
How long to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting spark.locality.wait.node, etc. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.
spark.locality.wait.node
default : spark.locality.wait
Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).
spark.locality.wait.process
default:spark.locality.wait
Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.
spark.locality.wait.rack
default:spark.locality.wait
Customize the locality wait for rack locality

Resources