Exactly once semantics in Spark Streaming Direct Approach - apache-spark

Spark's official documentation says the Direct based approach involves using SimpleConsumer API which doesn't use Zookeeper to store offsets and instead storing the offsets using Spark's metadata checkpointing. The documentation also says Direct based approach guarantees exactly once semantics.
When we enable Spark's metadata checkpointing using ssc.checkpoint("directory"), we never specify the interval.
Now, for each microbatch, triggered after the microbatch interval, the driver sends the offsets to each task which retrieve data for the corresponding Kafka partition.
Questions:
Considering the corresponding data retrieved from Kafka for the specified offsets is not persisted in Spark and only the offsets are stored in Spark as part of its metadata checkpointing, doesn't the timing of the checkpointing matter as it directly influences exactly once or at least/most once semantics? Does it happen as soon as the microbatch is triggered and directstream retrieves data from kafka or does it happen at the end of the microbatch completion?
Also, what do the offsets store as part of metadata checkpointing signify? Does it specify the offsets processed or the offsets yet to be processed?

Checkpointing is one of the options out of three [Checkpoints, Kafka itself, Your own data store], Checkpointing has several drawbacks, and cannot guarantee exactly-once unless your transaction is idempotent.
The documentation warns you about Checkpointing as below :
So if you want the equivalent of exactly-once semantics, you must
either store offsets after an idempotent output, or store offsets in
an atomic transaction alongside output.
See this section of the official documentation describing the three option in detail

Related

How does Structured Streaming ensure exactly-once writing semantics for file sinks?

I am writing a storage writer for spark structured streaming which will partition the given dataframe and write to a different blob store account. The spark documentation says the it ensures exactly once semantics for file sinks but also says that the exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Is the blob store an idempotent sink if I write in parquet format?
Also how will the behavior change if I am doing streamingDF.writestream.foreachbatch(...writing the DF here...).start()? Will it still guarantee exactly once semantics?
Possible duplicate : How to get Kafka offsets for structured query for manual and reliable offset management?
Update#1 : Something like -
output
.writeStream
.foreachBatch((df: DataFrame, _: Long) => {
path = storagePaths(r.nextInt(3))
df.persist()
df.write.parquet(path)
df.unpersist()
})
Micro-Batch Stream Processing
I assume that the question is about Micro-Batch Stream Processing (not Continuous Stream Processing).
Exactly once semantics are guaranteed based on available and committed offsets internal registries (for the current stream execution, aka runId) as well as regular checkpoints (to persist processing state across restarts).
exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
It is possible that whatever has already been processed but not recorded properly internally (see below) can be re-processed:
That means that all streaming sources in a streaming query should be re-playable to allow for polling for data that has once been requested.
That also means that the sink should be idempotent so the data that has been processed successfully and added to the sink may be added again because a failure happened just before Structured Streaming managed to record the data (offsets) as successfully processed (in the checkpoint)
Internals
Before the available data (by offset) of any of the streaming source or reader is processed, MicroBatchExecution commits the offsets to Write-Ahead Log (WAL) and prints out the following INFO message to the logs:
Committed offsets for batch [currentBatchId]. Metadata [offsetSeqMetadata]
A streaming query (a micro-batch) is executed only when there is new data available (based on offsets) or the last execution requires another micro-batch for state management.
In addBatch phase, MicroBatchExecution requests the one and only Sink or StreamWriteSupport to process the available data.
Once a micro-batch finishes successfully the MicroBatchExecution commits the available offsets to commits checkpoint and the offsets are considered processed already.
MicroBatchExecution prints out the following DEBUG message to the logs:
Completed batch [currentBatchId]
When you use foreachBatch, spark guarantee only that foreachBatch will call only one time. But if you will have exception during execution foreachBatch, spark will try to call it again for same batch. In this case we can have duplication if we store to multiple storages and have exception during storing.
So you can manually handle exception during storing for avoid duplication.
In my practice I created custom sink if need to store to multiple storage and use datasource api v2 which support commit.

Spark structured streaming from Kafka checkpoint and acknowledgement

In my spark structured streaming application, I am reading messages from Kafka, filtering them and then finally persisting to Cassandra. I am using spark 2.4.1. From the structured streaming documentation
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
But I am not sure how does Spark actually achieve this. In my case, if the Cassandra cluster is down leading to failures in the write operation, will the checkpoint for Kafka not record those offsets.
Is the Kafka checkpoint offset based only on successful reads from Kafka, or the entire operation including write is considered for each message?
Spark Structured Streaming is not commiting offsets to kafka as a "normal" kafka consumer would do.
Spark is managing the offsets internally with a checkpointing mechanism.
Have a look at the first response of following question which gives a good explanation about how the state is managed with checkpoints and commitslog: How to get Kafka offsets for structured query for manual and reliable offset management?
Spark uses multiple log files to ensure fault tolerance.
The ones relevant to your query are the offset log and the commit log.
from the StreamExecution class doc:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
so when it reads from Kafka it writes the offsets to the offsetLog and only after processing the data and writing it to the sink (in your case Cassandra) it writes the offsets to the commitLog.

Spark 2.3.1 Structured Streaming state store inner working

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?
It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.
Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.
There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS.
While In-Memory HashMap is for data storage, HDFS is for fault rolerance.
The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)
But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation
You are correct that there is no such documentation available.
I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery.
But old docks (like https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable. As a solution, there is a practice to support storing offsets in external storage that supports transactions like MySQL or RedshiftDB.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
Previously, it can be done by casting RDD to HasOffsetRanges:
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
But with new Streaming API, I have an Dataset of InternalRow and I can't find an easy way to fetch offsets. The Sink API has only addBatch(batchId: Long, data: DataFrame) method and how can I suppose to get an offset for given batch id?
Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint dir to store offsets and guarantee an "exactly-once" message delivery.
Correct.
Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is supposed to guarantee that offsets are processed at most once. The feature is called Write Ahead Logs.
The other directory in the checkpoint location is commits directory for completed streaming batches with a single file per batch (with a file name being the batch id).
Quoting the official documentation in Fault Tolerance Semantics:
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
Every time a trigger is executed StreamExecution checks the directories and "computes" what offsets have been processed already. That gives you at least once semantics and exactly once in total.
But old docs (...) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable.
There was a reason why you called them "old", wasn't there?
They refer to the old and (in my opinion) dead Spark Streaming that kept not only offsets but the entire query code that led to situations where the checkpointing were almost unusable, e.g. when you change the code.
The times are over now and Structured Streaming is more cautious what and when is checkpointed.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
A solution could be to implement or somehow use MetadataLog interface that is used to deal with offset checkpointing. That could work.
how can I suppose to get an offset for given batch id?
It is not currently possible.
My understanding is that you will not be able to do it as the semantics of streaming are hidden from you. You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
Quoting Michael Armbrust from his talk at Spark Summit Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark:
you should not have to reason about streaming
and further in the talk (on the next slide):
you should write simple queries & Spark should continuously update the answer
There is a way to get offsets (from any source, Kafka including) using StreamingQueryProgress that you can intercept using StreamingQueryListener and onQueryProgress callback.
onQueryProgress(event: QueryProgressEvent): Unit Called when there is some status update (ingestion rate updated, etc.)
With StreamingQueryProgress you can access sources property with SourceProgress that gives you what you want.
Relevant Spark DEV mailing list discussion thread is here.
Summary from it:
Spark Streaming will support getting offsets in future versions (> 2.2.0). JIRA ticket to follow - https://issues-test.apache.org/jira/browse/SPARK-18258
For Spark <= 2.2.0, you can get offsets for the given batch by reading a json from checkpoint directory (the API is not stable, so be cautious):
val checkpointRoot = // read 'checkpointLocation' from custom sink params
val checkpointDir = new Path(new Path(checkpointRoot), "offsets").toUri.toString
val offsetSeqLog = new OffsetSeqLog(sparkSession, checkpointDir)
val endOffset: Map[TopicPartition, Long] = offsetSeqLog.get(batchId).map { endOffset =>
endOffset.offsets.filter(_.isDefined).map { str =>
JsonUtilsWrapper.jsonToOffsets(str.get.json)
}
}
/**
* Hack to access private API
* Put this class into org.apache.spark.sql.kafka010 package
*/
object JsonUtilsWrapper {
def offsetsToJson(partitionOffsets: Map[TopicPartition, Long]): String = {
JsonUtils.partitionOffsets(partitionOffsets)
}
def jsonToOffsets(str: String): Map[TopicPartition, Long] = {
JsonUtils.partitionOffsets(str)
}
}
This endOffset will contain the until offset for each topic/partition.
Getting the start offsets is problematic, cause you have to read the 'commit' checkpoint dir. But usually, you don't care about start offsets, because storing end offsets is enough for reliable Spark job re-start.
Please, note that you have to store the processed batch id in your storage as well. Spark can re-run failed batch with the same batch id in some cases, so make sure to initialize a Custom Sink with latest processed batch id (which you should read from external storage) and ignore any batch with id < latestProcessedBatchId. Btw, batch id is not unique across queries, so you have to store batch id for each query separately.
Streaming Dataset with Kafka source has offset as one of the field. You can simply query for all offsets in query and save them into JDBC Sink

How can I understand check point recorvery when using Kafka Direct InputDstream and stateful stream transformation?

On yarn-cluster I use kafka directstream as input(ex.batch time is 15s),and want to aggregate the input msg in seperate userIds.
So I use stateful streaming api like updateStateByKey or mapWithState.But from the api source,I see that the mapWithState's default checkpoint duration is batchduration * 10 (in my case 150 s),and in kafka directstream the partition offset is checkpointed at every batch(15 s).Actually,every dstream can set different checkpoint duration.
So, my question is:
When streaming app crashed,I restart it,the kafka offset and state stream rdd are asynchronous in checkpoint,in this case how can I keep no data lose? Or I misunderstand the checkpoint mechanism?
How can I keep no data lose?
Stateful streams such as mapWithState or updateStateByKey require you to provide a checkpoint directory because that's part of how they operate, they store the state every intermediate to be able to recover the state upon a crash.
Other than that, each DStream in the chain is free to request checkpointing as well, question is "do you really need to checkpoint other streams"?
If an application crashes, Spark takes all the state RDDs stored inside the checkpoint and brings then back to memory, so your data there is as good as it was the last time spark checkpointed it there. One thing to keep in my mind is, if you change your application code, you cannot recover state from checkpoint, you'll have to delete it. This means that if for instance you need to do a version upgrade, all data that was previously stored in the state will be gone unless you manually save it yourself in a manner which allows versioning.

Resources