What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI - apache-spark

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?

spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Related

How to limit number of batches to run in Spark Structured Streaming forEachBatch?

I'm reading data from Kafka in batch fashion using readStream then doing some transfromations and writing the data using forEachBacth & writeStream.
I have a usecase to hold the job for sometime and so i want to limit the job for x number of batches. Is it possible to do in Spark Structured Streaming ? Specifically, Spark 2.4.8

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?
spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

Spark Kafka streaming multiple partition window processing

I am using spark kafka streaming using createDirectStream approach (Direct Approach)
also my requirement is to create window on stream but I have seen this in spark doc
"However, be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window()."
how can I solve this as my data is getting corrupted

Resources