Functionality and excution of queueStream in SparkStreaming? - apache-spark

What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.

QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?
spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

spark transform method behaviour in multiple partition

I am using Kafka Streaming to read data from Kafka topic, and i want to join every RDD that i get in the stream to an existing RDD. So i think using "transform" is the best option (Unless any one disagrees, and suggest a better approach)
And, I read following example of "transform" method on DStreams in Spark:
val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information
val cleanedDStream = wordCounts.transform { rdd =>
rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
...
}
But lets say, i have 3 partitions in the Kafka topic, and that i invoke 3 consumers to read from those. Now, this transform method will be called in three separate threads in parallel.
I am not sure if joining the RDDs in this case will be Thread-safe and this will not result in data-loss. (considering that RDDs are immutable)
Also, if you say that its thread-safe then wouldn't the performance be low since we are creating so many RDDs and then joining them?
can anybody suggest?

Data distribution in Apache Spark

I'm new to spark and have general question.As far as I know the whole file must be available on all worker nodes to be processed.If so, how do they know which partition should read?Driver controls the partitions but how does driver tell them to read what partition?
Each RDD is divided into multiple partition. To compute each partition, Spark will generate a task and assign it to a worker node. When the driver sends a task to the worker, it also specifies the PartitionID of that task.
The worker then executes the task by chaining the RDD's iterator all the way back to the InputRDD and pass along the PartitionID. The InputRDD determines which part of the input corresponding to the specified partition id and return the data.
rddIter.next -> parentRDDIter.next -> grandParentRDDIter.next -> ... -> InputRDDIter.next
Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
https://github.com/jaceklaskowski/mastering-apache-spark-book

Resources