Using RDD.checkpoint to recover rdd in case of application crash - apache-spark

I am writing a Spark ( not Streaming ) application that has a many iterations. I would like to checkpoint my rdd on every Nth iteration so that if my application crashes I can rerun it from the last checkpoint. All the references I found for this use case seem to be for Spark Streaming apps where a full checkpoint of the entire program can be easily saved by an one application run and then read ( getOrCreate ) by another.
How can I read a checkpointed rdd in regular Spark?

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

How remove old spark streaming data?

How remove old spark streaming data?
We have spark streaming process which reads data from kafka, converts data and writes to hdfs.
And we have another spark process which creates spark sql query to spark streaming results created by first process.
The first process writes checkpoints to hdfs, their directories:
/commits
/metadata
/offsets
/sources
and also this process creates directory /_spark_metadata in directory which setup for streaming result.
We did not find a way to remove the streaming data that is no longer needed.
If we just delete the files that made the streaming, the first process produces an error, the second one, too. If you delete the directory
/_spark_metadata, the search process starts to search, but it looks slowly, and the first process produces an error until you delete the directory with its metadata
How remove old spark streaming data correctly?

Can Spark identify the directory of checkpoint automaticity?

I'm learning Spark recently, got confused about checkpoint.
I have learned that checkpoint can store RDD in a local or HDFS directory, and it will truncate the lineage of RDD. But how can I get the right checkpoint file in another driver program? Can Spark get the path automaticity?
For example, I checkpointed an RDD in the first driver program, and want to reuse it in the second driver program, but the second driver program didn't know the path of the checkpoint file, is it possible to reuse the checkpoint file?
I wrote a demo about checkpoint as bellow. I checkpoint the "sum" RDD, and collect it after.
val ds = spark.read.option("delimiter", ",").csv("/Users/lulijun/git/spark_study/src/main/resources/sparktest.csv")
.toDF("dt", "org", "pay", "per", "ord", "origin")
val filtered = ds.filter($"dt" > "20171026")
val groupby = filtered.groupBy("dt")
val sum = groupby.agg(("ord", "sum"), ("pay", "max"))
sum.count()
sum.checkpoint()
sum.collect()
But I found in the Spark Job triggered by action "collect", RDD nerver read checkpoint. Is it because the "sum" RDD already exists in memory? I'm confused about the method "computeOrReadCheckpoint", when will it read checkpoint?
/**
* Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
*/
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
}
By the way, what's the main difference between RDD checkpoint and chekpointing in Spark Streaming?
Any help would be appreciated.
Thanks!
Checkpointing in batch mode is used only to cut the lineage. It is not designed for sharing data between different applications. Checkpoint data is used when single RDD is in multiple actions. In other words it is not applicable in your scenario. To share data between application you should write it to reliable distributed storage.
Checkpointing in streaming is used to provide fault tolerance in case of application failure. Once application is restarted it can reuse checkpoints to restore data and / or metadata. Similarly to batch it is not designed for data sharing.

Spark Streaming shuffle data size on disk keeps increasing

I have a basic spark streaming app that reads in logs from kafka and does a join with a batch RDD and saves the result to cassandra
val users = sparkStmContext.cassandraTable[User](usersKeyspace, usersTable)
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
sparkStmContext, Map(("zookeeper.connect", zkConnectionString), ("group.id", groupId)), Map(TOPIC -> 1), StorageLevel.MEMORY_ONLY_SER)
.transform(rdd => StreamTransformations.joinLogsWithUsers(rdd, users)) // simple join with users batch RDD and stream RDD
.saveToCassandra("users", "user_logs")
when running this though I notice that some of the shuffle data in "spark.local.dir" is not being removed and the directory size keeps growing until I run out of disk space.
Spark is supposed to take care of deleting this data (see last comment from TD).
I set the "spark.cleaner.ttl" but this had no effect.
Is this happening because I create the users RDD outside of the stream and using lazy loading to reload it every streaming window (as the user table can be updated by another process) and hence then it never goes out of scope?
Is there a better pattern to use for this use case as any docs I have seen have taken the same approach as me?

OFF_HEAP rdd was removed automatically by Tachyon, after the spark job done

I run a spark application, it uses a StorageLevel.OFF_HEAP to persist a rdd(my tachyon and spark are both in local mode).
like this:
val lines = sc.textFile("FILE_PATH/test-lines-1")
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).persist(StorageLevel.OFF_HEAP)
val counts = words.reduceByKey(_ + _)
counts.collect.foreach(println)
...
sc.stop
when persist done, I can see my OFF_HEAP files from localhost:19999(tachyon's web UI), this is what i excepted.
But, after the spark application over(sc.stop, but tachyon is working), my blocks(OFF_HEAP rdd) were removed. And I can not find my files from localhost:19999. This is not what I want.
I think these files belong to Tachyon (not spark) after persist() method, they should not be removed.
so, who deleted my files, and when?
Is this the normal way?
You are looking for
saveAs[Text|Parquet|NewHadoopAPI]File()
This is the real "persistent" method you need.
Instead
persist()
is used for intermediate storage of RDD's: when the spark process ends they will be removed. Here is from the source code comments:
Set this RDD's storage level to persist its values across operations after the first time
it is computed.
The important phrase is across operations - that is as part of processing (only).

Resources