Spark Streaming Re-Use Physical Plan - apache-spark

We have a Spark Streaming application, which performs a few heavy stateful computation against the incoming stream of data. Here the state is maintained in some storage (HDFS/Hive/Hbase/Cassandra) and at the end of every window the delta change in state is updated back using Append Only write strategy.
The issue is that for each and every window the planning phase is taking a lot of time; in-fact more than the compute time.
dStream.foreachRDD(rdd => {
val dataset_1 = rdd.toDS()
val dataset_2 = dataset_1.join(..)
val dataset_3 = dataset_2
.map(..)
.filter(..)
.join(..)
// A few more Joins & Transformations
val finalDataset = ..
finalDataset
.write
.option("maxRecordsPerFile", 5000)
.format(save_format)
.mode("append")
.insertInto("table_name")
})
Is there a way to re-use the Physical Plan from the last window and avoid planning stages every window by Spark, because practically nothing would have changed between windows.

I don't think so and that's one of the many reasons why you should be using Spark Structured Streaming instead.
Among the features of the underlying stream engine is to re-use physical plan of a streaming query.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery.
But old docks (like https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable. As a solution, there is a practice to support storing offsets in external storage that supports transactions like MySQL or RedshiftDB.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
Previously, it can be done by casting RDD to HasOffsetRanges:
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
But with new Streaming API, I have an Dataset of InternalRow and I can't find an easy way to fetch offsets. The Sink API has only addBatch(batchId: Long, data: DataFrame) method and how can I suppose to get an offset for given batch id?
Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint dir to store offsets and guarantee an "exactly-once" message delivery.
Correct.
Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is supposed to guarantee that offsets are processed at most once. The feature is called Write Ahead Logs.
The other directory in the checkpoint location is commits directory for completed streaming batches with a single file per batch (with a file name being the batch id).
Quoting the official documentation in Fault Tolerance Semantics:
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
Every time a trigger is executed StreamExecution checks the directories and "computes" what offsets have been processed already. That gives you at least once semantics and exactly once in total.
But old docs (...) says that Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable.
There was a reason why you called them "old", wasn't there?
They refer to the old and (in my opinion) dead Spark Streaming that kept not only offsets but the entire query code that led to situations where the checkpointing were almost unusable, e.g. when you change the code.
The times are over now and Structured Streaming is more cautious what and when is checkpointed.
If I want to store offsets from Kafka source to a transactional DB, how can I obtain offset from a structured stream batch?
A solution could be to implement or somehow use MetadataLog interface that is used to deal with offset checkpointing. That could work.
how can I suppose to get an offset for given batch id?
It is not currently possible.
My understanding is that you will not be able to do it as the semantics of streaming are hidden from you. You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
Quoting Michael Armbrust from his talk at Spark Summit Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark:
you should not have to reason about streaming
and further in the talk (on the next slide):
you should write simple queries & Spark should continuously update the answer
There is a way to get offsets (from any source, Kafka including) using StreamingQueryProgress that you can intercept using StreamingQueryListener and onQueryProgress callback.
onQueryProgress(event: QueryProgressEvent): Unit Called when there is some status update (ingestion rate updated, etc.)
With StreamingQueryProgress you can access sources property with SourceProgress that gives you what you want.
Relevant Spark DEV mailing list discussion thread is here.
Summary from it:
Spark Streaming will support getting offsets in future versions (> 2.2.0). JIRA ticket to follow - https://issues-test.apache.org/jira/browse/SPARK-18258
For Spark <= 2.2.0, you can get offsets for the given batch by reading a json from checkpoint directory (the API is not stable, so be cautious):
val checkpointRoot = // read 'checkpointLocation' from custom sink params
val checkpointDir = new Path(new Path(checkpointRoot), "offsets").toUri.toString
val offsetSeqLog = new OffsetSeqLog(sparkSession, checkpointDir)
val endOffset: Map[TopicPartition, Long] = offsetSeqLog.get(batchId).map { endOffset =>
endOffset.offsets.filter(_.isDefined).map { str =>
JsonUtilsWrapper.jsonToOffsets(str.get.json)
}
}
/**
* Hack to access private API
* Put this class into org.apache.spark.sql.kafka010 package
*/
object JsonUtilsWrapper {
def offsetsToJson(partitionOffsets: Map[TopicPartition, Long]): String = {
JsonUtils.partitionOffsets(partitionOffsets)
}
def jsonToOffsets(str: String): Map[TopicPartition, Long] = {
JsonUtils.partitionOffsets(str)
}
}
This endOffset will contain the until offset for each topic/partition.
Getting the start offsets is problematic, cause you have to read the 'commit' checkpoint dir. But usually, you don't care about start offsets, because storing end offsets is enough for reliable Spark job re-start.
Please, note that you have to store the processed batch id in your storage as well. Spark can re-run failed batch with the same batch id in some cases, so make sure to initialize a Custom Sink with latest processed batch id (which you should read from external storage) and ignore any batch with id < latestProcessedBatchId. Btw, batch id is not unique across queries, so you have to store batch id for each query separately.
Streaming Dataset with Kafka source has offset as one of the field. You can simply query for all offsets in query and save them into JDBC Sink

Functionality and excution of queueStream in SparkStreaming?

What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.
QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.

Spark Streaming shuffle data size on disk keeps increasing

I have a basic spark streaming app that reads in logs from kafka and does a join with a batch RDD and saves the result to cassandra
val users = sparkStmContext.cassandraTable[User](usersKeyspace, usersTable)
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
sparkStmContext, Map(("zookeeper.connect", zkConnectionString), ("group.id", groupId)), Map(TOPIC -> 1), StorageLevel.MEMORY_ONLY_SER)
.transform(rdd => StreamTransformations.joinLogsWithUsers(rdd, users)) // simple join with users batch RDD and stream RDD
.saveToCassandra("users", "user_logs")
when running this though I notice that some of the shuffle data in "spark.local.dir" is not being removed and the directory size keeps growing until I run out of disk space.
Spark is supposed to take care of deleting this data (see last comment from TD).
I set the "spark.cleaner.ttl" but this had no effect.
Is this happening because I create the users RDD outside of the stream and using lazy loading to reload it every streaming window (as the user table can be updated by another process) and hence then it never goes out of scope?
Is there a better pattern to use for this use case as any docs I have seen have taken the same approach as me?

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources