How is spark.streaming.blockInterval related to RDD partitions? - apache-spark

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?

spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

Related

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

is spark kafka-stream-reader caching data

I found this is a good question to ask, I might be able to find answer in the spark-kafka-streaming source code, I will do that if no one could answer this.
imagine scenario like this:
val dstream = ...
dstream.foreachRDD(
rdd=>
rdd.count()
rdd.collect()
)
in the example code above, as we can see we are getting micro-batches from dstream and for each batch we are triggering 2 actions.
count() how many rows
collect() all the rows
according to Spark's lazy eval behaviour, both actions will trace back to the origin of the data source(which is kafka topic), and also since we don't have any persist() or wide transformations, there is no way in our code logic that would make spark cache the data it have read from kafka.
so here is the question. Will spark read from kafka twice or just once? this is very perf related since reading from kafka involves netIO and potentially puts more pressure on the kafka brokers. so if spark-kafka-streaming lib won't cache it, we should definitely cache()/persist() it before multi-actions.
any discussions are welcome. thanks.
EDIT:
just found some docs on spark official website, looks like executor receivers are caching the data. but I don't know if this is for separate receivers only. because I read that spark kafka streaming lib doesn't use separate receivers, it receives data and process the data on the same core.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.
according to official docs from Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.
There is no implicit caching when working with DStreams so unless you cache explicitly, every evaluation will hit Kafka brokers.
If you evaluate multiple times, and brokers are not co-located with Spark nodes, you should definitely consider caching.

Functionality and excution of queueStream in SparkStreaming?

What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.
QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.

What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?
spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Resources