Can Apache Spark repartition the data received from single Kafka partition - apache-spark

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

Related

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?
spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

How to process Kafka partitions separately and in parallel with Spark executors?

I use Spark 2.1.1.
I read messages from 2 Kafka partitions using Structured Streaming. I am submitting my application to Spark Standalone cluster with one worker and 2 executors (2 cores each).
./bin/spark-submit \
--class MyClass \
--master spark://HOST:IP \
--deploy-mode cluster \
/home/ApplicationSpark.jar
I want the functionality such that, the messages from each Kafka partition should be processed by each separate executor independently. But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions.
When I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
select product_id, max(smr.order_time), max(product_price) , min(product_price)
from OrderRecords
group by WINDOW(order_time, "120 seconds"), product_id
where Kafka partition is on Product_id
Is there any way to run the same structured query parallel but separately on the data, from the Kafka partition to which the executor is mapped?
But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions. Hence when I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
That's the key to understand what and how can be executed without causing shuffle and sending data across partitions (possibly even over the wire).
The definitive answer depends on what your queries are. If they work on groups of records where the groups are spread across multiple topic partitions and hence on two different Spark executors, you'd have to be extra careful with your algorithm/transformation to do the processing on separate partitions (using only what's available in partitions) and aggregating the results only.

What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?
spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Spark Kafka streaming multiple partition window processing

I am using spark kafka streaming using createDirectStream approach (Direct Approach)
also my requirement is to create window on stream but I have seen this in spark doc
"However, be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window()."
how can I solve this as my data is getting corrupted

Resources