spark streaming checkpoint : Data checkpointing control - apache-spark

I have something confused about the spark streaming checkpoint, please help me, thanks!
There are two types of checkpointing (Metadata & Data checkpointing). And the guides said when using stateful transformations, data checkpointing is used. I'm very confused about this. If I don't use stateful transformations, does spark still write Data checkpointing content?
Can I control the checkpoint position in codes ?
Can I control which rdd can be written to data checkpointing data in streaming like batch spark job ?
Can I use foreachRDD rdd => rdd.checkpoint() in streaming?
If I don't use the rdd.checkpoint(), what is the default behavior of Spark? Which rdd can be written to HDFS?

You can find excellent guide with this Link.
No, there is no need to checkpoint data, because no intermediate data you need in case of stateless computation.
I don't think you need checkpoint any rdd after computation in streaming. The rdd checkpoint is designed to address lineage issue, the streaming checkpoint is all about streaming reliability and failure recovery.

Related

Does spark.sql.adaptive.enabled work for Spark Structured Streaming?

I work with Apache Spark Structured Streaming. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Since It builds on the Spark SQL engine, does it mean spark.sql.adaptive.enabled works for Spark Structured Streaming?
It's disabled in Spark code - See in StreamExecution:
// Adaptive execution can change num shuffle partitions, disallow
sparkSessionForStream.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")
The reason for that is because it might cause issues when having state on the stream (more details in the ticket that added this restriction - SPARK-19873 ).
If you still want to enable it for the Spark Structured Streaming (e.g. if you are sure that it won't cause any harm in your use case), you can do that inside the foreachBatch method, by setting batchDF.sparkSession.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "true") - this will override the Spark code which disabled it.
No. Stated in this doc https://docs.databricks.com/spark/latest/spark-sql/aqe.html. Non-streaming not supported, does not apply for AQE.
Think of statefulness, small datasets ideally. Many restrictions in Spark Structured Streaming.

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.
I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Spark Kafka streaming multiple partition window processing

I am using spark kafka streaming using createDirectStream approach (Direct Approach)
also my requirement is to create window on stream but I have seen this in spark doc
"However, be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window()."
how can I solve this as my data is getting corrupted

Resources