Specifying checkpoint location when structured streaming the data from kafka topics - apache-spark

I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to read after restarting and is it good idea to have checkpoint specified in the write stream to make sure we are reading from the point where the application/spark has failed?
Please let me know.

I would advise you to set offsets to earliest and configure a checkpointLocation (HDFS, MinIO, other). The setting kafka.group.id will not commit offsets back to Kafka (even in Spark 3+), unless you commit them manually using foreachBatch.

You can use checkpoints, yes, or you can set kafka.group.id (in Spark 3+, at least).
Otherwise, it may start back at the end of the topic

Related

Structured Streaming startingOffest and Checkpoint

I am confused about startingOffsets in structured streaming.
In the official docs here, it says query type
Streaming - is this continuous streaming?
Batch - is this for query with forEachBatch or triggers? (latest is not allowed)
My workflow also has checkpoints enabled. How does this work together with startingOffsets?
If my workflow crashes and I have startingOffsets as latest, does spark check kafka offset or the spark checkpoint offset or both?
Streaming by default means "micro-batch" in Spark. Dependent on the trigger you set, it will check the source for new data on the given frequency. You can use it by
val df = spark
.readStream
.format("kafka")
.[...]
For Kafka there is also the experimental continuous trigger that allows processing of the data with quite a low latency. See the section Continuous Processing in the documentation.
Batch on the other hand works like reading a text file (such as csv) that you do once. You can do this by using
val df = spark
.read
.format("kafka")
.[...]
Note the difference in readStream for streaming and read for batch processing. In batch mode the startingOffset can only be set to earliest and even if you use checkpointing it will always start from earliest offset in case of a planned or unplanned restart.
Checkpointing in Structured Streaming needs to be set in the writeStream part and needs to be unique for every single query (in case you are running multiple streaming queries from the same source). If you have that checkpoint location set up and you restart your application Spark will only look into those checkpoint files. Only when the query gets started for the very first time it checks the startingOffset option.
Remember that Structured Streaming never commits any offsets back to Kafka. It only relies on its checkpoint files. See my other answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.
In case you plan to run your application, say, once a day it is therefore best to use readStream with checkpointing enabled and the trigger writeStream.trigger(Trigger.Once). A good explanation for this approach is given in a Databricks blog on Running Streaming Jobs Once a Day For 10x Cost Savings.

How to figure out Kafka startingOffsets and endingOffsets in a scheduled Spark batch job?

I am trying to read from a Kafka topic in my Spark batch job and publish to another topic. I am not using streaming because it does not fit our use case. According to the spark docs, the batch job starts reading from the earliest Kafka offsets by default, and so when I run the job again, it again reads from the earliest. How do I make sure that the job picks up the next offset from where it last read?
According to the Spark Kafka Integration docs, there are options to specify "startingOffsets" and "endingOffsets". But how do I figure them out?
I am using the spark.read.format("kafka") API to read the data from Kafka as a Dataset. But I did not find any option to get the start and end offset range from this Dataset read.

How to implement exactly-once processing when reading reading from directory using Spark Structured Streaming?

I'd like to use the concept of stream processing to read files from a local directory and then publish to Apache Kafka. I thought about using Spark Structured Streaming.
How is the checkpointing implemented when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time or will it again read from the start of the file?
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time Or will it again read from the start of the file.
Either the entire file is fully processed or none at all. That's how FileFormat works in general in Spark SQL and has little to do with Spark Structured Streaming in particular (since they share the underlying execution infrastructure).
In short, the engine "will it again read from the start of the file."
That also to say that there is no concept of a single line while processing files in Spark Structured Streaming. You process a streaming DataFrame that is an entire file (or even a couple of files) all at once and whether you want to process the dataset line by line or in its entirety is up to you, a Spark developer.
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
In theory, you should not. The purpose of the new checkpointing mechanism in Spark Structured Streaming (as compared to the legacy Spark Streaming's) was to allow for restarts and upgrades in a more comfortable way. The checkpointing uses just a little information (usually stored in JSON files) to restart processing from the point of the last successful checkpoint.

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery.
But old docks (like https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable. As a solution, there is a practice to support storing offsets in external storage that supports transactions like MySQL or RedshiftDB.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
Previously, it can be done by casting RDD to HasOffsetRanges:
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
But with new Streaming API, I have an Dataset of InternalRow and I can't find an easy way to fetch offsets. The Sink API has only addBatch(batchId: Long, data: DataFrame) method and how can I suppose to get an offset for given batch id?
Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint dir to store offsets and guarantee an "exactly-once" message delivery.
Correct.
Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is supposed to guarantee that offsets are processed at most once. The feature is called Write Ahead Logs.
The other directory in the checkpoint location is commits directory for completed streaming batches with a single file per batch (with a file name being the batch id).
Quoting the official documentation in Fault Tolerance Semantics:
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
Every time a trigger is executed StreamExecution checks the directories and "computes" what offsets have been processed already. That gives you at least once semantics and exactly once in total.
But old docs (...) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable.
There was a reason why you called them "old", wasn't there?
They refer to the old and (in my opinion) dead Spark Streaming that kept not only offsets but the entire query code that led to situations where the checkpointing were almost unusable, e.g. when you change the code.
The times are over now and Structured Streaming is more cautious what and when is checkpointed.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
A solution could be to implement or somehow use MetadataLog interface that is used to deal with offset checkpointing. That could work.
how can I suppose to get an offset for given batch id?
It is not currently possible.
My understanding is that you will not be able to do it as the semantics of streaming are hidden from you. You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
Quoting Michael Armbrust from his talk at Spark Summit Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark:
you should not have to reason about streaming
and further in the talk (on the next slide):
you should write simple queries & Spark should continuously update the answer
There is a way to get offsets (from any source, Kafka including) using StreamingQueryProgress that you can intercept using StreamingQueryListener and onQueryProgress callback.
onQueryProgress(event: QueryProgressEvent): Unit Called when there is some status update (ingestion rate updated, etc.)
With StreamingQueryProgress you can access sources property with SourceProgress that gives you what you want.
Relevant Spark DEV mailing list discussion thread is here.
Summary from it:
Spark Streaming will support getting offsets in future versions (> 2.2.0). JIRA ticket to follow - https://issues-test.apache.org/jira/browse/SPARK-18258
For Spark <= 2.2.0, you can get offsets for the given batch by reading a json from checkpoint directory (the API is not stable, so be cautious):
val checkpointRoot = // read 'checkpointLocation' from custom sink params
val checkpointDir = new Path(new Path(checkpointRoot), "offsets").toUri.toString
val offsetSeqLog = new OffsetSeqLog(sparkSession, checkpointDir)
val endOffset: Map[TopicPartition, Long] = offsetSeqLog.get(batchId).map { endOffset =>
endOffset.offsets.filter(_.isDefined).map { str =>
JsonUtilsWrapper.jsonToOffsets(str.get.json)
}
}
/**
* Hack to access private API
* Put this class into org.apache.spark.sql.kafka010 package
*/
object JsonUtilsWrapper {
def offsetsToJson(partitionOffsets: Map[TopicPartition, Long]): String = {
JsonUtils.partitionOffsets(partitionOffsets)
}
def jsonToOffsets(str: String): Map[TopicPartition, Long] = {
JsonUtils.partitionOffsets(str)
}
}
This endOffset will contain the until offset for each topic/partition.
Getting the start offsets is problematic, cause you have to read the 'commit' checkpoint dir. But usually, you don't care about start offsets, because storing end offsets is enough for reliable Spark job re-start.
Please, note that you have to store the processed batch id in your storage as well. Spark can re-run failed batch with the same batch id in some cases, so make sure to initialize a Custom Sink with latest processed batch id (which you should read from external storage) and ignore any batch with id < latestProcessedBatchId. Btw, batch id is not unique across queries, so you have to store batch id for each query separately.
Streaming Dataset with Kafka source has offset as one of the field. You can simply query for all offsets in query and save them into JDBC Sink

How Spark Structured Streaming handles backpressure?

I'm analyzing the backpressure feature on Spark Structured Streaming. Does anyone know the details? Is it possible to tune process incoming records by code?
Thanks
If you mean dynamically changing the size of each internal batch in Structured Streaming, then NO. There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source. Read the following links for more details:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Handling back pressure is needed only is push based mechanisms. Kafka consumers are pull based, spark will pull next batch of records only when current batch is finished processing and saving. If processing & saving is delayed in spark, it won't pull new batch of records so no need of back pressure handling.
maxOffsetsPerTrigger can change the number of records processed per spark batch set, backpressure.enabled changes rate of receiving, but that's not same as back pressure where you go and tell the source to slow dow.

Resources