Periodically persist the computed result using Spark Streaming? - apache-spark

I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.
I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.
My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.
I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .
But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.
If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.

Key points:
enable write ahead log for receiver
enable checkpoint
Detail
enable WAL: set spark.streaming.receiver.writeAheadLog.enable true
enable checkpoint
checkpoint is to write your app state to reliable storage periodically. And when your application fails, it can recover from checkpoint file.
To write checkpoint, write this:
ssc.checkpoint("checkpoint.path")
To read from checkpoint:
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate("checkpoint_path", () => createContext())
ssc.start()
ssc.awaitTermination()
}
in the createContext function, you should create ssc and do your own logic. For example:
def createContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName("app.name")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds("streaming.interval"))
ssc.checkpoint("checkpoint.path")
// your code here
ssc
}
Here is the document about necessary steps about how to deploy spark streaming applications, including recover from driver/executor failure.
https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#deploying-applications

Spark Streaming acts as a consumer application. In real time,data being pulled from Kafka topics where you can store the offset of data in some data store. This is also true if you are reading data from Twitter streams. You can follow the below posts to store the offset and if the application crashed or restarted.
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
https://www.linkedin.com/pulse/achieving-exactly-once-semantics-kafka-application-ishan-kumar

Related

Is there any possible to achieve dynamic batch size in Spark Streaming?

In order to reduce the difficulty of the code, I allow to restart the Spark Streaming system to use new batch size, but need to keep the previous progress (allowing to lose the batch being processed).
If I use checkpoint in Spark Streaming, it can't change all configurations when the application restarts.
So I want to implement this function by modifying the source code, but I don't know where to start. Hope to give some guidance and tell me the difficulty.
Since you are talking about the batch size I'm assuming that you are asking about spark streaming and not structured streaming.
There is a way to programmatically set the value for your batch interval, refer this link for documentation.
the constructor of StreamingContext accepts the duration class's object which defines the batch interval.
you can pass batch interval size by hardcoding it in the code, which will require you to build the jar file every time when you need to change batch interval, instead, you can bring it from a config file, this way you don't need to build the code every time.
Note: You have to set this property in the applications's config file, not in the spark's config file.
You can change the config for batch interval and restart the application, this will not cause any problems with respect to checkpointing.
val sparkConf: SparkConf = new SparkConf()
.setAppName("app-name")
.setMaster("app-master")
val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(config.getInt("batch-interval")))
Cheers!!

Spark Streaming Re-Use Physical Plan

We have a Spark Streaming application, which performs a few heavy stateful computation against the incoming stream of data. Here the state is maintained in some storage (HDFS/Hive/Hbase/Cassandra) and at the end of every window the delta change in state is updated back using Append Only write strategy.
The issue is that for each and every window the planning phase is taking a lot of time; in-fact more than the compute time.
dStream.foreachRDD(rdd => {
val dataset_1 = rdd.toDS()
val dataset_2 = dataset_1.join(..)
val dataset_3 = dataset_2
.map(..)
.filter(..)
.join(..)
// A few more Joins & Transformations
val finalDataset = ..
finalDataset
.write
.option("maxRecordsPerFile", 5000)
.format(save_format)
.mode("append")
.insertInto("table_name")
})
Is there a way to re-use the Physical Plan from the last window and avoid planning stages every window by Spark, because practically nothing would have changed between windows.
I don't think so and that's one of the many reasons why you should be using Spark Structured Streaming instead.
Among the features of the underlying stream engine is to re-use physical plan of a streaming query.

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

Kafka broker (0.10.0 or higher) as DStream source for Spark Streaming in Python

Concretely, I am looking for a replacement or work-around for the KafkaUtils.createStream() API call in pyspark.streaming.kafka in Kafka 0.8.0.
Trying to use this (depreciated) function with Kafka 0.10.0 produces an error. I was thinking about creating a custom receiver but there isn't any pyspark support here either. It also seems like there is no fix in the make.
Here's an abstract of the application I'm trying to build. The application wants to create a live (aggregated) dashboard from different production line resources, which are fed into Kafka. At the same time, the processed data will go to permanent storage. The goal is to create an anomaly detection system from this permanent data.
I can work around this problem for the permanent storage by batch processing the data before sending it off. But this obviously doesn't work for streaming.
Below you can find the pseudo code of what the script should look like:
sc = SparkContext(appName='abc')
sc.setLogLevel('WARN')
ssc = StreamingContext(sc, 2)
## Create Dstream object from Kafka (This is where I'm stuck)
## Transform and create aggregated windows
ssc.start()
## Catch output and send back to Kafka as producer
All advice and solutions are more than welcome.

How to config checkpoint to redeploy spark streaming application?

I'm using Spark streaming to count unique users. I use updateStateByKey, so I need config a checkpoint directory. I also load the data from checkpoint while start the application, as the example in the doc:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Here is the question, if my code is changed, then I re-deploy the code, will the checkpoint be loaded no matter how much the code is changed? Or I need to use my own logic to persistence my data and load them in the next run.
If I use my own logic to save and load the DStream, then if the application restart on failure, won't the data loaded both from checkpoint directory and my own database?
The checkpoint itself includes your metadata,rdd,dag and even your logic.If you change your logic and try to run it from the last checkpoint, your are very likely to hit an exception.
If you want to use your own logic to save your data somewhere as checkpoint, you might need to implement an spark action to push your checkpoint data to whatever database, in the next run, load the checkpoint data as an initial RDD (in case u are using updateStateByKey API) and continue your logic.
I've asked this question in the Spark mail list and have got an answer, I've analyzed it on my blog. I'll post the summarize here:
The way is to use both checkpointing and our own data loading mechanism. But we load our data as the initalRDD of updateStateByKey. So in both situations, the data will neither lost nor duplicate:
When we change the code and redeploy the Spark application, we shutdown the old Spark application gracefully and cleanup the checkpoint data, so the only loaded data is the data we saved.
When the Spark application is failure and restart, it will load the data from checkpoint. But the step of DAG is saved so it will not load our own data as initalRDD again. So the only loaded data is the checkpointed data.

Resources