Kafka Partition+Spark Streaming Context - apache-spark

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.

Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

Does Spark Structured Streaming maintain the order of Kafka messages?

I have a Spark Structured Streaming application that consumes messages from multiple Kafka topics and writes the results to another Kafka topic. To maintain the integrity of the data, it's imperative that the order of messages in source partitions is maintained. So if message A precedes message B in a partition, processed(A) should be written to the output topic before processed(B) (processed A and B will go to the same partition too as the same hash string is used).
Does Spark Structured Streaming guarantee this?

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?
spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?
spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Resources