spark streaming DirectKafkaInputDStream: kafka data source can easily stress the driver node - apache-spark

I am building a prototype with spark streaming 1.5.0. DirectKafkaInputDStream is used.
And a simple stage to read from kafka by DirectKafkaInputDStream can't handle massive amount of messages. The stage spends longer time then batch interval, once the message rate reach or exceed a certain value. And the rate is much lower than I expect. ( I have done another benchmark of my kafka cluster with multiple consumer instances in different servers)
JavaPairInputDStream<String, String> recipeDStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringKeyDecoder.class,
StringDecoder.class,
kafkaParams, kafkaTopicsSet);
After reading this article, I realize that the DirectKafkaInputDStream is run on the same node as the driver program. is it ture? If so, then DirectKafkaInputDStream can easily be stressed as it read all message in one node then dispatch to all executors.
And it means JavaPairReceiverInputDStream has better performance in handling high volume data, since receivers runs on multiple executor instances.
Am I right? Can someone explain this? Thank you.

No, the direct stream is only communicating from the driver to kafka in order to find the latest available offsets. Actual messages are read only on the executors.
Switching .createStream to .createDirectStream should in general perform better, not worse. If you've got a small reproducible example to the contrary, please share it on the spark mailing list or jira.

Related

Kafka + spark streaming: kafka.common.OffsetOutOfRangeException

I'm new to this whole Kafka/Spark thing. I have Spark Streaming (PySpark) taking in data from a Kafka producer. It runs fine for a minute and then always throws a kafka.common.OffsetOutOfRangeException. The Kafka consumer is version 0.8 (0.10 is not supported, apparently, for PySpark). I have a master node with 3 workers on AWS Ubuntu 14.04. I don't know if this is relevant, but the Kafka logs here are relatively large (~1-10kb) and I've adjusted the producer/broker/consumer configs accordingly. The data is being passed through fine, though maybe slower than what I think the producer is probably producing (this may be the source of the problem?).
A similar problem was solved by increasing the retention time/size here: Kafka OffsetOutOfRangeException
But my retention time is an hour and the size is 1GB in each node's server.properties, and more importantly, there's no change in Spark's time-to-failure and the set retention time/size.
Is there any other possibility for adjustment, maybe on the Spark Streaming configs? All the answers I see online have to do with Kafka provisioning, but it doesn't seem to make a difference in my case.
EDIT 1: I tried a) having multiple streams reading from the producer and b) slowing down the producer stream itself with time.sleep(1.0). Neither had a lasting effect.
i.e.
n_secs = 1
ssc = StreamingContext(sc, n_secs)
kds = [KafkaUtils.createDirectStream(ssc, ['test-video'], {
'bootstrap.servers':'localhost:9092',
'group.id':'test-video-group',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'}) for _ in range(n_streams)]
stream = ssc.union(*kds)
Is it possible that your producer generates too many messages too fast so that 1G is not enough on each broker? 1G seems very low in all reality. After Spark Streaming decides the offset range it needs to process in the micro batch and try to retrieve the messages from the broker based on the offset, the messages are gone due to size limit. Please increase the broker size to something bigger like 100G and see if that fixes your problem.

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Pause Spark Kafka Direct Stream

I have following code that creates a direct stream using Kafka connector for Spark. However I want to handle a situation, where I can decide that this streaming needs to pause for a while on conditional basis, is there any way to achieve this?
Say my Kafka is undergoing some maintenance, so between 10AM to 12PM stop processing, and then again pick up at 12PM from the last offset, how do I do it?
final JavaInputDStream<KafkaMessage> msgRecords = KafkaUtils.createDirectStream(
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
KafkaMessage.class, kafkaParams, topicsPartitions,
message -> {
return KafkaMessage.builder()
.
.build();
}
);
There are two ways:
Stop the spark context (from a separate time-monitoring thread) during those times that you want processing to be stopped and start them back up when you need processing to resume. This is best suited for large intervals (on the order of hours). This is the most efficient in terms of Spark utilization so you would not be uselessly occupying slots in the Spark cluster.
Retrieve the spark batch time for your transformations, and conditioned on the batch time, decide whether or not to proceed to the rest of the transformation. Unfortunately, fetching the Spark batch time is not trivial; it's available if you change the DStream API to input the batch time as part of the transform API.

Spark Streaming input rate drop

Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these graphs.
This happens even though the program could keep up and it slows down execution considerably. I believe the drop happens when there is not much unprocessed data left, but because the rate is so low, these final records take up most of the time needed to run the job. Is there any way to avoid this and speed up?
I am using PySpark with Spark 1.6.2 and using the direct approach for Kafka streaming. Backpressure is turned on and there is a maxRatePerPartition of 100.
Setting backpressure is more meaningful in the case of old spark streaming versions where you need receivers to consume the messages from a stream. From Spark 1.3 you have the receiver-less “direct” approach to ensure stronger end-to-end guarantees. So you do not need to worry about backpressure as spark does most of the fine tuning.

Kafka.Utils.createRDD Vs KafkaDirectStreaming

I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.
Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams
Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.

Resources