Confused about Spark streaming application execution time - apache-spark

I have a simple Spark streaming WordCount application, which reads data from a Kafka topic. In this application, checkpoint is enabled to calculate accumulated word count. The stream internals is 1000ms. The following picture shows a table (delay, execution time, total delay, events) of micro-batches in this stream application. What makes me confused is that every 10 seconds, there is a micro-batch which takes around 4 seconds to execute, which is much more than the execution time of other micro-batches. Why does this situation occur? My application is just a very simple word count program.

Related

Spark-Streaming Application Optimisation using Repartition

I am trying to Optimize My Spark Streaming Application and I am able to Optimize it by repartition. However I am not able to Understand How exactly Repartition is working here and optimising the Streaming Process.
can anyone help me to understand below scenario.
I have created 2 Kafka Topics. let's say SrcTopic, DestTopic With 6 Partitions.While Processing the data from SrcTopic to DestTopic In My Streaming Application I have batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000, So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Fine as expected and Taking Avg 250-300 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now , I have Updated My SparkStreming Job Delated checkpoints and Again Processing data for the same source and destination (all the configurations are exactly same for the topics/Using same topics which I mentioned In first Point), Here Only Change I did it like Before Writing the data in DestTopic I have repartitioned my Dataframe (df.repartition(6)) and Then Sink into Kafka Topic.for This Process also I am Taking batchInterval of 5 Min, And Kept maxOffsetPerTrigger as 10000,So Streaming Application will Process the data after every 5 min and will Take max 10K Record in a batch and will produce in DestTopic.This Processing is Also Fine as expected but Taking Avg 25-30 Sec to Process one complete batch(Consume from SrcTopic and Produce in DestTopic).
Now My doubt is.
For the first and 2nd Process No of Partitions are exactly same.
Both The Process has 6 Partitions in SrcTopic and DestTopic.
I checked the count of each partitions( 0,1,2,3,4,5) It's same in Both the cases(partition and repartition).
Executing Both the Application With Exactly same Configuration.
What extra repartition is doing here, so It's taking 10 time less time as compared to Normal Partition.
can You Help me to Understand the Process.

What is the difference between duration vs processing time vs batch duration in spark ui?

As the picture below, what's the difference between duration, batch duration and processing time in spark UI ?
thanks
Spark UI Picture
The batch duration of 1 min tells you, that your Spark streaming application works in batches of 1 minute, meaning it plans an RDD every minute. You set this duration in your code when creating the SparkSession.
The processing time tells you that it took Spark 34 seconds to process all input data (provided as input data size).
The duration gives you an understanding of the time it took to finish a particular job within your application.
Duration is wall clock. Processing time is the sum of all jobs durations.

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?

When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.

Dynamically update batch size of input for spark kafka consumer

I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time the batch job can finish within about 5 seconds. However, in very rare cases, the batch job need cost 60 seconds and this will delay some batch of jobs.
To cut down the total delay time, I hope I can process more streaming data which spread over the delayed jobs at one time. This will help the streaming return to normal as soon as possible.
So, I want to know there is some method to dynamically update/merge batch size of input for spark and kafka when delay appears.
You can set the "spark.streaming.backpressure.enabled" option to true.
If the batch delay occurs when the backpressure option is true, it initially starts with a small batch size and then dynamically changes to a large batch size.
See the spark configuration document.
You can see the description below.
Enables or disables Spark Streaming's internal backpressure mechanism
(since 1.5). This enables the Spark Streaming to control the receiving
rate based on the current batch scheduling delays and processing times
so that the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).

Spark streaming : batch processing time slowly increase

I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.

Resources