is spark kafka-stream-reader caching data - apache-spark

I found this is a good question to ask, I might be able to find answer in the spark-kafka-streaming source code, I will do that if no one could answer this.
imagine scenario like this:
val dstream = ...
dstream.foreachRDD(
rdd=>
rdd.count()
rdd.collect()
)
in the example code above, as we can see we are getting micro-batches from dstream and for each batch we are triggering 2 actions.
count() how many rows
collect() all the rows
according to Spark's lazy eval behaviour, both actions will trace back to the origin of the data source(which is kafka topic), and also since we don't have any persist() or wide transformations, there is no way in our code logic that would make spark cache the data it have read from kafka.
so here is the question. Will spark read from kafka twice or just once? this is very perf related since reading from kafka involves netIO and potentially puts more pressure on the kafka brokers. so if spark-kafka-streaming lib won't cache it, we should definitely cache()/persist() it before multi-actions.
any discussions are welcome. thanks.
EDIT:
just found some docs on spark official website, looks like executor receivers are caching the data. but I don't know if this is for separate receivers only. because I read that spark kafka streaming lib doesn't use separate receivers, it receives data and process the data on the same core.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.

according to official docs from Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.

There is no implicit caching when working with DStreams so unless you cache explicitly, every evaluation will hit Kafka brokers.
If you evaluate multiple times, and brokers are not co-located with Spark nodes, you should definitely consider caching.

Related

Spark Direct Stream Kafka order of events

I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase.
I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ).
The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered at partition level.
My problem begins when I start reading using Spark.
Using the direct stream approach, each executor will read from one partition. I am not doing any shuffling of data (just parse and save), so my events won't get messed up among the RDD, but I am worried that when the executor reads the partition, it won't maintain the order so I will end up with incorrect data in HBase when I save them.
How can I ensure that the order is kept at executor level, especially if I use multiple cores in one executor (which from my understanding result in multiple threads)?
I think I can also live with 1 core if this fixes the issue and by turning off speculative execution, enabling spark back pressure optimizations and keeping the maximum retries on executor to 1.
I have also thought about implementing a sort on the events at spark partition level using the Kafka offset.
Any advice?
Thanks a lot in advance!

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?
Quoting Spark Streaming 2.2.0 documentation:
For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.
Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?
spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach
And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)
That means block interval configuration is of no use in DStream API.

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Repartition in repartitionAndSortWithinPartitions happens on driver or on worker

I am trying to understand the concept of repartitionAndSortWithinPartitions in Spark Streaming whether the repartition happens on driver or on worker. If suppose it happens on driver then does worker wait for all the data to come before sorting happens.
Like any other transformation it is handled by executors. Data is not passed via the driver. In other words this standard shuffle mechanism and there is nothing streaming specific here.
Destination of each record is defined by:
Its key.
Partitioner used for a given shuffle.
Number of partitions.
and data is passed directly between executor nodes.
From the comments it looks like you're more interested in a Spark Streaming architecture. If that's the case you should take a look at Diving into Apache Spark Streaming’s Execution Model. To give you some overview there can exist two different types of streams:
Receiver-based with a receiver node per stream.
Direct (without receiver) where only metadata is assigned to executors but data is fetched directly.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Resources