Pause Spark Kafka Direct Stream - apache-spark

I have following code that creates a direct stream using Kafka connector for Spark. However I want to handle a situation, where I can decide that this streaming needs to pause for a while on conditional basis, is there any way to achieve this?
Say my Kafka is undergoing some maintenance, so between 10AM to 12PM stop processing, and then again pick up at 12PM from the last offset, how do I do it?
final JavaInputDStream<KafkaMessage> msgRecords = KafkaUtils.createDirectStream(
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
KafkaMessage.class, kafkaParams, topicsPartitions,
message -> {
return KafkaMessage.builder()
.
.build();
}
);

There are two ways:
Stop the spark context (from a separate time-monitoring thread) during those times that you want processing to be stopped and start them back up when you need processing to resume. This is best suited for large intervals (on the order of hours). This is the most efficient in terms of Spark utilization so you would not be uselessly occupying slots in the Spark cluster.
Retrieve the spark batch time for your transformations, and conditioned on the batch time, decide whether or not to proceed to the rest of the transformation. Unfortunately, fetching the Spark batch time is not trivial; it's available if you change the DStream API to input the batch time as part of the transform API.

Related

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?

When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.

how to consume kafka message not from the beginning after a failed spark job

I'm newbie to kafka and spark, wondering how to recover offset from kafka after a spark job failed.
conditions:
say 5gb/s of kafka stream, it's hard to consume from beginning
stream data has already been consumed, so how to tell spark to re-consume message / redo the failed task smoothly
I'm not to sure which area to search for, maybe someone can point me to right direction
When we are dealing with kafka, we have must have 2 different topics. One for Success and One for Failed.
Let's say, I have 2 topics Topic-Success and Topic-Failed.
When Kafka processing the data stream successfully, we can mark it and store it in Topic-Success Topic and When Kafka unable to Process data stream, then will store it in Topic-Failed Topic.
So that, when you want to re-consume the failed data stream, we can process that failed one from Topic-Failed Topic. Here you can eliminate re-consuming all the data from-beginning.
Hope this helps you.
In kafka 0.10.x there is a concept of Consumer Group which is used to track the offset of the messages.
If you have made enable.auto.commit=true and auto.offset.reset=latest it will not consume from beginning. Now taking this approach you might also need to track your offsets as the process might failed after consumption. I would suggest you to use this method suggested in Spark Docs to
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
CanCommitOffsets lies in your hands to commit those messages when your end to end pipeline gets excuted

Order of messages with Spark Executors

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?
The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.
To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !

Kafka.Utils.createRDD Vs KafkaDirectStreaming

I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.
Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams
Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.

spark streaming DirectKafkaInputDStream: kafka data source can easily stress the driver node

I am building a prototype with spark streaming 1.5.0. DirectKafkaInputDStream is used.
And a simple stage to read from kafka by DirectKafkaInputDStream can't handle massive amount of messages. The stage spends longer time then batch interval, once the message rate reach or exceed a certain value. And the rate is much lower than I expect. ( I have done another benchmark of my kafka cluster with multiple consumer instances in different servers)
JavaPairInputDStream<String, String> recipeDStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringKeyDecoder.class,
StringDecoder.class,
kafkaParams, kafkaTopicsSet);
After reading this article, I realize that the DirectKafkaInputDStream is run on the same node as the driver program. is it ture? If so, then DirectKafkaInputDStream can easily be stressed as it read all message in one node then dispatch to all executors.
And it means JavaPairReceiverInputDStream has better performance in handling high volume data, since receivers runs on multiple executor instances.
Am I right? Can someone explain this? Thank you.
No, the direct stream is only communicating from the driver to kafka in order to find the latest available offsets. Actual messages are read only on the executors.
Switching .createStream to .createDirectStream should in general perform better, not worse. If you've got a small reproducible example to the contrary, please share it on the spark mailing list or jira.

Resources