Can Spark identify the directory of checkpoint automaticity? - apache-spark

I'm learning Spark recently, got confused about checkpoint.
I have learned that checkpoint can store RDD in a local or HDFS directory, and it will truncate the lineage of RDD. But how can I get the right checkpoint file in another driver program? Can Spark get the path automaticity?
For example, I checkpointed an RDD in the first driver program, and want to reuse it in the second driver program, but the second driver program didn't know the path of the checkpoint file, is it possible to reuse the checkpoint file?
I wrote a demo about checkpoint as bellow. I checkpoint the "sum" RDD, and collect it after.
val ds = spark.read.option("delimiter", ",").csv("/Users/lulijun/git/spark_study/src/main/resources/sparktest.csv")
.toDF("dt", "org", "pay", "per", "ord", "origin")
val filtered = ds.filter($"dt" > "20171026")
val groupby = filtered.groupBy("dt")
val sum = groupby.agg(("ord", "sum"), ("pay", "max"))
sum.count()
sum.checkpoint()
sum.collect()
But I found in the Spark Job triggered by action "collect", RDD nerver read checkpoint. Is it because the "sum" RDD already exists in memory? I'm confused about the method "computeOrReadCheckpoint", when will it read checkpoint?
/**
* Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
*/
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
}
By the way, what's the main difference between RDD checkpoint and chekpointing in Spark Streaming?
Any help would be appreciated.
Thanks!

Checkpointing in batch mode is used only to cut the lineage. It is not designed for sharing data between different applications. Checkpoint data is used when single RDD is in multiple actions. In other words it is not applicable in your scenario. To share data between application you should write it to reliable distributed storage.
Checkpointing in streaming is used to provide fault tolerance in case of application failure. Once application is restarted it can reuse checkpoints to restore data and / or metadata. Similarly to batch it is not designed for data sharing.

Related

spark streaming checkpoint : Data checkpointing control

I have something confused about the spark streaming checkpoint, please help me, thanks!
There are two types of checkpointing (Metadata & Data checkpointing). And the guides said when using stateful transformations, data checkpointing is used. I'm very confused about this. If I don't use stateful transformations, does spark still write Data checkpointing content?
Can I control the checkpoint position in codes ?
Can I control which rdd can be written to data checkpointing data in streaming like batch spark job ?
Can I use foreachRDD rdd => rdd.checkpoint() in streaming?
If I don't use the rdd.checkpoint(), what is the default behavior of Spark? Which rdd can be written to HDFS?
You can find excellent guide with this Link.
No, there is no need to checkpoint data, because no intermediate data you need in case of stateless computation.
I don't think you need checkpoint any rdd after computation in streaming. The rdd checkpoint is designed to address lineage issue, the streaming checkpoint is all about streaming reliability and failure recovery.

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Using RDD.checkpoint to recover rdd in case of application crash

I am writing a Spark ( not Streaming ) application that has a many iterations. I would like to checkpoint my rdd on every Nth iteration so that if my application crashes I can rerun it from the last checkpoint. All the references I found for this use case seem to be for Spark Streaming apps where a full checkpoint of the entire program can be easily saved by an one application run and then read ( getOrCreate ) by another.
How can I read a checkpointed rdd in regular Spark?

How to config checkpoint to redeploy spark streaming application?

I'm using Spark streaming to count unique users. I use updateStateByKey, so I need config a checkpoint directory. I also load the data from checkpoint while start the application, as the example in the doc:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Here is the question, if my code is changed, then I re-deploy the code, will the checkpoint be loaded no matter how much the code is changed? Or I need to use my own logic to persistence my data and load them in the next run.
If I use my own logic to save and load the DStream, then if the application restart on failure, won't the data loaded both from checkpoint directory and my own database?
The checkpoint itself includes your metadata,rdd,dag and even your logic.If you change your logic and try to run it from the last checkpoint, your are very likely to hit an exception.
If you want to use your own logic to save your data somewhere as checkpoint, you might need to implement an spark action to push your checkpoint data to whatever database, in the next run, load the checkpoint data as an initial RDD (in case u are using updateStateByKey API) and continue your logic.
I've asked this question in the Spark mail list and have got an answer, I've analyzed it on my blog. I'll post the summarize here:
The way is to use both checkpointing and our own data loading mechanism. But we load our data as the initalRDD of updateStateByKey. So in both situations, the data will neither lost nor duplicate:
When we change the code and redeploy the Spark application, we shutdown the old Spark application gracefully and cleanup the checkpoint data, so the only loaded data is the data we saved.
When the Spark application is failure and restart, it will load the data from checkpoint. But the step of DAG is saved so it will not load our own data as initalRDD again. So the only loaded data is the checkpointed data.

OFF_HEAP rdd was removed automatically by Tachyon, after the spark job done

I run a spark application, it uses a StorageLevel.OFF_HEAP to persist a rdd(my tachyon and spark are both in local mode).
like this:
val lines = sc.textFile("FILE_PATH/test-lines-1")
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).persist(StorageLevel.OFF_HEAP)
val counts = words.reduceByKey(_ + _)
counts.collect.foreach(println)
...
sc.stop
when persist done, I can see my OFF_HEAP files from localhost:19999(tachyon's web UI), this is what i excepted.
But, after the spark application over(sc.stop, but tachyon is working), my blocks(OFF_HEAP rdd) were removed. And I can not find my files from localhost:19999. This is not what I want.
I think these files belong to Tachyon (not spark) after persist() method, they should not be removed.
so, who deleted my files, and when?
Is this the normal way?
You are looking for
saveAs[Text|Parquet|NewHadoopAPI]File()
This is the real "persistent" method you need.
Instead
persist()
is used for intermediate storage of RDD's: when the spark process ends they will be removed. Here is from the source code comments:
Set this RDD's storage level to persist its values across operations after the first time
it is computed.
The important phrase is across operations - that is as part of processing (only).

Resources