How to config checkpoint to redeploy spark streaming application? - apache-spark

I'm using Spark streaming to count unique users. I use updateStateByKey, so I need config a checkpoint directory. I also load the data from checkpoint while start the application, as the example in the doc:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Here is the question, if my code is changed, then I re-deploy the code, will the checkpoint be loaded no matter how much the code is changed? Or I need to use my own logic to persistence my data and load them in the next run.
If I use my own logic to save and load the DStream, then if the application restart on failure, won't the data loaded both from checkpoint directory and my own database?

The checkpoint itself includes your metadata,rdd,dag and even your logic.If you change your logic and try to run it from the last checkpoint, your are very likely to hit an exception.
If you want to use your own logic to save your data somewhere as checkpoint, you might need to implement an spark action to push your checkpoint data to whatever database, in the next run, load the checkpoint data as an initial RDD (in case u are using updateStateByKey API) and continue your logic.

I've asked this question in the Spark mail list and have got an answer, I've analyzed it on my blog. I'll post the summarize here:
The way is to use both checkpointing and our own data loading mechanism. But we load our data as the initalRDD of updateStateByKey. So in both situations, the data will neither lost nor duplicate:
When we change the code and redeploy the Spark application, we shutdown the old Spark application gracefully and cleanup the checkpoint data, so the only loaded data is the data we saved.
When the Spark application is failure and restart, it will load the data from checkpoint. But the step of DAG is saved so it will not load our own data as initalRDD again. So the only loaded data is the checkpointed data.

Related

Is there any possible to achieve dynamic batch size in Spark Streaming?

In order to reduce the difficulty of the code, I allow to restart the Spark Streaming system to use new batch size, but need to keep the previous progress (allowing to lose the batch being processed).
If I use checkpoint in Spark Streaming, it can't change all configurations when the application restarts.
So I want to implement this function by modifying the source code, but I don't know where to start. Hope to give some guidance and tell me the difficulty.
Since you are talking about the batch size I'm assuming that you are asking about spark streaming and not structured streaming.
There is a way to programmatically set the value for your batch interval, refer this link for documentation.
the constructor of StreamingContext accepts the duration class's object which defines the batch interval.
you can pass batch interval size by hardcoding it in the code, which will require you to build the jar file every time when you need to change batch interval, instead, you can bring it from a config file, this way you don't need to build the code every time.
Note: You have to set this property in the applications's config file, not in the spark's config file.
You can change the config for batch interval and restart the application, this will not cause any problems with respect to checkpointing.
val sparkConf: SparkConf = new SparkConf()
.setAppName("app-name")
.setMaster("app-master")
val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(config.getInt("batch-interval")))
Cheers!!

Can Spark identify the directory of checkpoint automaticity?

I'm learning Spark recently, got confused about checkpoint.
I have learned that checkpoint can store RDD in a local or HDFS directory, and it will truncate the lineage of RDD. But how can I get the right checkpoint file in another driver program? Can Spark get the path automaticity?
For example, I checkpointed an RDD in the first driver program, and want to reuse it in the second driver program, but the second driver program didn't know the path of the checkpoint file, is it possible to reuse the checkpoint file?
I wrote a demo about checkpoint as bellow. I checkpoint the "sum" RDD, and collect it after.
val ds = spark.read.option("delimiter", ",").csv("/Users/lulijun/git/spark_study/src/main/resources/sparktest.csv")
.toDF("dt", "org", "pay", "per", "ord", "origin")
val filtered = ds.filter($"dt" > "20171026")
val groupby = filtered.groupBy("dt")
val sum = groupby.agg(("ord", "sum"), ("pay", "max"))
sum.count()
sum.checkpoint()
sum.collect()
But I found in the Spark Job triggered by action "collect", RDD nerver read checkpoint. Is it because the "sum" RDD already exists in memory? I'm confused about the method "computeOrReadCheckpoint", when will it read checkpoint?
/**
* Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
*/
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
compute(split, context)
}
}
By the way, what's the main difference between RDD checkpoint and chekpointing in Spark Streaming?
Any help would be appreciated.
Thanks!
Checkpointing in batch mode is used only to cut the lineage. It is not designed for sharing data between different applications. Checkpoint data is used when single RDD is in multiple actions. In other words it is not applicable in your scenario. To share data between application you should write it to reliable distributed storage.
Checkpointing in streaming is used to provide fault tolerance in case of application failure. Once application is restarted it can reuse checkpoints to restore data and / or metadata. Similarly to batch it is not designed for data sharing.

Periodically persist the computed result using Spark Streaming?

I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.
I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.
My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.
I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .
But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.
If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.
Key points:
enable write ahead log for receiver
enable checkpoint
Detail
enable WAL: set spark.streaming.receiver.writeAheadLog.enable true
enable checkpoint
checkpoint is to write your app state to reliable storage periodically. And when your application fails, it can recover from checkpoint file.
To write checkpoint, write this:
ssc.checkpoint("checkpoint.path")
To read from checkpoint:
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate("checkpoint_path", () => createContext())
ssc.start()
ssc.awaitTermination()
}
in the createContext function, you should create ssc and do your own logic. For example:
def createContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName("app.name")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds("streaming.interval"))
ssc.checkpoint("checkpoint.path")
// your code here
ssc
}
Here is the document about necessary steps about how to deploy spark streaming applications, including recover from driver/executor failure.
https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#deploying-applications
Spark Streaming acts as a consumer application. In real time,data being pulled from Kafka topics where you can store the offset of data in some data store. This is also true if you are reading data from Twitter streams. You can follow the below posts to store the offset and if the application crashed or restarted.
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
https://www.linkedin.com/pulse/achieving-exactly-once-semantics-kafka-application-ishan-kumar

How can I understand check point recorvery when using Kafka Direct InputDstream and stateful stream transformation?

On yarn-cluster I use kafka directstream as input(ex.batch time is 15s),and want to aggregate the input msg in seperate userIds.
So I use stateful streaming api like updateStateByKey or mapWithState.But from the api source,I see that the mapWithState's default checkpoint duration is batchduration * 10 (in my case 150 s),and in kafka directstream the partition offset is checkpointed at every batch(15 s).Actually,every dstream can set different checkpoint duration.
So, my question is:
When streaming app crashed,I restart it,the kafka offset and state stream rdd are asynchronous in checkpoint,in this case how can I keep no data lose? Or I misunderstand the checkpoint mechanism?
How can I keep no data lose?
Stateful streams such as mapWithState or updateStateByKey require you to provide a checkpoint directory because that's part of how they operate, they store the state every intermediate to be able to recover the state upon a crash.
Other than that, each DStream in the chain is free to request checkpointing as well, question is "do you really need to checkpoint other streams"?
If an application crashes, Spark takes all the state RDDs stored inside the checkpoint and brings then back to memory, so your data there is as good as it was the last time spark checkpointed it there. One thing to keep in my mind is, if you change your application code, you cannot recover state from checkpoint, you'll have to delete it. This means that if for instance you need to do a version upgrade, all data that was previously stored in the state will be gone unless you manually save it yourself in a manner which allows versioning.

Why does Spark throw "SparkException: DStream has not been initialized" when restoring from checkpoint?

I am restoring a stream from a HDFS checkpoint (ConstantInputDSTream for example) but I keep getting SparkException: <X> has not been initialized.
Is there something specific I need to do when restoring from checkpointing?
I can see that it wants DStream.zeroTime set but when the stream is restored zeroTime is null. It doesn't get restored possibly due to it being a private member IDK. I can see that the StreamingContext referenced by the restored stream does have a value for zeroTime.
initialize is a private method and gets called at StreamingContext.graph.start but not by StreamingContext.graph.restart, presumably because it expects zeroTime to have been persisted.
Does someone have an example of a Stream that recovers from a checkpoint and has a non null value for zeroTime?
def createStreamingContext(): StreamingContext = {
val ssc = new StreamingContext(sparkConf, Duration(1000))
ssc.checkpoint(checkpointDir)
ssc
}
val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext)
val socketStream = ssc.socketTextStream(...)
socketStream.checkpoint(Seconds(1))
socketStream.foreachRDD(...)
The problem was that I created the dstreams after the StreamingContext had been recreated from checkpoint, i.e. after StreamingContext.getOrCreate. Creating dstreams and all transformations should've been in createStreamingContext.
The issue was filled as [SPARK-13316] "SparkException: DStream has not been initialized" when restoring StreamingContext from checkpoint and the dstream is created afterwards.
This Exception may also occur when you are trying to use same check-pointing directory for 2 different spark streaming jobs. In that case also you will get this exception.
Try using unique checkpoint directory for each spark job.
ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.spark.SparkException: org.apache.spark.streaming.dstream.FlatMappedDStream#6c17c0f8 has not been initialized
at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:313)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
The above error was due to the fact that I also had another Spark Job writing to the same checkpointdir. Even though the other spark job was not running, the fact that it had written to the checkpointdir, the new Spark Job was not able to configure the StreamingContext.
I deleted the contents of the checkpointdir and resubmitted the Spark Job, and the issue was resolved.
Alternatively, you can just use a separate checkpointdir for each Spark Job, to keep it simple.

Resources