Spark structured streaming consistency across sinks - apache-spark

I'd like to understand better the consistency model of Spark 2.2 structured streaming in the following case :
one source (Kinesis)
2 queries from this source towards 2 different sinks : one file sink for archive purpose (S3), and another sink for processed data (DB or file, not yet decided)
I'd like to understand if there's any consistency guarantee across sinks, at least under certain circumstances :
Can one of the sink be way ahead of the other ? Or are they consuming data at the same speed on the source (since its the same source) ? Can they be synchronous ?
If I (gracefully) stop the stream application, will the data on the 2 sinks consistent ?
The reason is I'd like to build a Kappa-like processing app, with the ability to suspend/shutdown the streaming part when I want to reprocess some history, and, when I resume the streaming, avoid reprocessing something that has already been processed (as being in the history), or missing some (eg. some data that has not been committed to the archive, and then skipped as already processed when the streaming resume)

One important thing to keep in mind is the 2 sinks will be used from 2 distinct queries, each reading independently from the source. So checkpointing is done per-query.
Whenever you call start on a DataStreamWriter that results in a new query and if you set checkpointLocation each query will have its own checkpointing to track offsets from the sink.
val input = spark.readStream....
val query1 = input.select('colA, 'colB)
.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir1")
.start("/path1")
val query2 = input.select('colA, 'colB)
.writeStream
.format("csv")
.option("checkpointLocation", "path/to/checkpoint/dir2")
.start("/path2")
So each query is reading from the source and tracking offsets independently. Which then also means, each query can be at different offsets of the input stream and you can restart either or both without impacting the other.
UPDATE
I wanted to make another suggestion now that Databricks Delta is open sourced. A common pattern I've used is landing data from upstream sources directly into an append-only Delta table. Then, with Structured Streaming, you can efficiently subscribe to the table and process the new records incrementally. Delta's internal transaction log is more efficient than S3 file listings required with the basic file source. This ensures you have a consistent source of data across multiple queries, pulling from S3 vs Kinesis.

What Silvio has written is absolutely correct.
Writing to 2 sinks will start two streaming queries running independently of each other ( effectively 2 streaming applications reading same data 2 times and processing 2 times and checkpointing on their own).
I want to just add that if you want both the queries to stop/pause at the same time in case of restart or failure of any one of the query, there is option of using an api : awaitAnyTermination()
Instead of using :
query.start().awaitTermination()
use :
sparkSession.streams.awaitAnyTermination()
adding excerpts from the api documentation :
/**
* Wait until any of the queries on the associated SQLContext has terminated since the
* creation of the context, or since `resetTerminated()` was called. If any query was terminated
* with an exception, then the exception will be thrown.
*
* If a query has terminated, then subsequent calls to `awaitAnyTermination()` will either
* return immediately (if the query was terminated by `query.stop()`),
* or throw the exception immediately (if the query was terminated with exception). Use
* `resetTerminated()` to clear past terminations and wait for new terminations.
*
* In the case where multiple queries have terminated since `resetTermination()` was called,
* if any query has terminated with exception, then `awaitAnyTermination()` will
* throw any of the exception. For correctly documenting exceptions across multiple queries,
* users need to stop all of them after any of them terminates with exception, and then check the
* `query.exception()` for each query.
*
* #throws StreamingQueryException if any query has terminated with an exception
*
* #since 2.0.0
*/
#throws[StreamingQueryException]
def awaitAnyTermination(): Unit = {
awaitTerminationLock.synchronized {
while (lastTerminatedQuery == null) {
awaitTerminationLock.wait(10)
}
if (lastTerminatedQuery != null && lastTerminatedQuery.exception.nonEmpty) {
throw lastTerminatedQuery.exception.get
}
}
}

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery.
But old docks (like https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable. As a solution, there is a practice to support storing offsets in external storage that supports transactions like MySQL or RedshiftDB.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
Previously, it can be done by casting RDD to HasOffsetRanges:
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
But with new Streaming API, I have an Dataset of InternalRow and I can't find an easy way to fetch offsets. The Sink API has only addBatch(batchId: Long, data: DataFrame) method and how can I suppose to get an offset for given batch id?
Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint dir to store offsets and guarantee an "exactly-once" message delivery.
Correct.
Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is supposed to guarantee that offsets are processed at most once. The feature is called Write Ahead Logs.
The other directory in the checkpoint location is commits directory for completed streaming batches with a single file per batch (with a file name being the batch id).
Quoting the official documentation in Fault Tolerance Semantics:
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
Every time a trigger is executed StreamExecution checks the directories and "computes" what offsets have been processed already. That gives you at least once semantics and exactly once in total.
But old docs (...) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable.
There was a reason why you called them "old", wasn't there?
They refer to the old and (in my opinion) dead Spark Streaming that kept not only offsets but the entire query code that led to situations where the checkpointing were almost unusable, e.g. when you change the code.
The times are over now and Structured Streaming is more cautious what and when is checkpointed.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
A solution could be to implement or somehow use MetadataLog interface that is used to deal with offset checkpointing. That could work.
how can I suppose to get an offset for given batch id?
It is not currently possible.
My understanding is that you will not be able to do it as the semantics of streaming are hidden from you. You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
Quoting Michael Armbrust from his talk at Spark Summit Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark:
you should not have to reason about streaming
and further in the talk (on the next slide):
you should write simple queries & Spark should continuously update the answer
There is a way to get offsets (from any source, Kafka including) using StreamingQueryProgress that you can intercept using StreamingQueryListener and onQueryProgress callback.
onQueryProgress(event: QueryProgressEvent): Unit Called when there is some status update (ingestion rate updated, etc.)
With StreamingQueryProgress you can access sources property with SourceProgress that gives you what you want.
Relevant Spark DEV mailing list discussion thread is here.
Summary from it:
Spark Streaming will support getting offsets in future versions (> 2.2.0). JIRA ticket to follow - https://issues-test.apache.org/jira/browse/SPARK-18258
For Spark <= 2.2.0, you can get offsets for the given batch by reading a json from checkpoint directory (the API is not stable, so be cautious):
val checkpointRoot = // read 'checkpointLocation' from custom sink params
val checkpointDir = new Path(new Path(checkpointRoot), "offsets").toUri.toString
val offsetSeqLog = new OffsetSeqLog(sparkSession, checkpointDir)
val endOffset: Map[TopicPartition, Long] = offsetSeqLog.get(batchId).map { endOffset =>
endOffset.offsets.filter(_.isDefined).map { str =>
JsonUtilsWrapper.jsonToOffsets(str.get.json)
}
}
/**
* Hack to access private API
* Put this class into org.apache.spark.sql.kafka010 package
*/
object JsonUtilsWrapper {
def offsetsToJson(partitionOffsets: Map[TopicPartition, Long]): String = {
JsonUtils.partitionOffsets(partitionOffsets)
}
def jsonToOffsets(str: String): Map[TopicPartition, Long] = {
JsonUtils.partitionOffsets(str)
}
}
This endOffset will contain the until offset for each topic/partition.
Getting the start offsets is problematic, cause you have to read the 'commit' checkpoint dir. But usually, you don't care about start offsets, because storing end offsets is enough for reliable Spark job re-start.
Please, note that you have to store the processed batch id in your storage as well. Spark can re-run failed batch with the same batch id in some cases, so make sure to initialize a Custom Sink with latest processed batch id (which you should read from external storage) and ignore any batch with id < latestProcessedBatchId. Btw, batch id is not unique across queries, so you have to store batch id for each query separately.
Streaming Dataset with Kafka source has offset as one of the field. You can simply query for all offsets in query and save them into JDBC Sink

Spark Streaming Replay

I have a Spark Streaming application to analyze events incoming from a Kafka broker. I have rules like below and new rules can be generated by combining existing ones:
If this event type occurs raise an alert.
If this event type occurs more than 3 times in a 5-minute interval, raise an alert.
In parallel, I save every incoming data to Cassandra. What I like to do is run this streaming app for historic data from Cassandra. For example,
<This rule> would have generated <these> alerts for <last week>.
Is there any way to do this in Spark or is it in roadmap? For example, Apache Flink has event time processing. But migrating existing codebase to it seems hard and I'd like to solve this problem with reusing my existing code.
This is fairly straight-forward, with some caveats. First, it helps to understand how this works from the Kafka side.
Kafka manages what are called offsets -- each message in Kafka has an offset relative to its position in a partition. (Partitions are logical divisions of a topic.) The first message in a partition has an offset of 0L, second one is 1L etc. Except that, because of log rollover and possibly topic compaction, 0L isn't always the earliest offset in a partition.
The first thing you are going to have to do is to collect the offsets for all of the partitions you want to read from the beginning. Here's a function that does this:
def getOffsets(consumer: SimpleConsumer, topic: String, partition: Int) : (Long,Long) = {
val time = kafka.api.OffsetRequest.LatestTime
val reqInfo = Map[TopicAndPartition,PartitionOffsetRequestInfo](
(new TopicAndPartition(topic, partition)) -> (new PartitionOffsetRequestInfo(time, 1000))
)
val req = new kafka.javaapi.OffsetRequest(
reqInfo, kafka.api.OffsetRequest.CurrentVersion, "test"
)
val resp = consumer.getOffsetsBefore(req)
val offsets = resp.offsets(topic, partition)
(offsets(offsets.size - 1), offsets(0))
}
You would call it like this:
val (firstOffset,nextOffset) = getOffsets(consumer, "MyTopicName", 0)
For everything you ever wanted to know about retrieving offsets from Kafka, read this. It's cryptic, to say the least. (Let me know when you fully understand the second argument to PartitionOffsetRequestInfo, for example.)
Now that you have firstOffset and lastOffset of the partition you want to look at historically, you then use the fromOffset parameter of createDirectStream, which is of type: fromOffset: Map[TopicAndPartition, Long]. You would set the Long / value to the firstOffset you got from getOffsets().
As for nextOffset -- you can use that to determine in your stream when you move from handling historical data to new data. If msg.offset == nextOffset then you are processing the first non-historical record within the partition.
Now for the caveats, directly from the documentation:
Once a context has been started, no new streaming computations can be
set up or added to it.
Once a context has been stopped, it cannot be
restarted.
Only one StreamingContext can be active in a JVM at the
same time.
stop() on StreamingContext also stops the SparkContext. To
stop only the StreamingContext, set the optional parameter of stop()
called stopSparkContext to false.
A SparkContext can be re-used to
create multiple StreamingContexts, as long as the previous
StreamingContext is stopped (without stopping the SparkContext)
before the next StreamingContext is created.
It's because of these caveats that I grab nextOffset at the same time as firstOffset -- so I can keep the stream up, but change the context from historical to present-time processing.

Spark SQL + Streaming issues

We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

Resources