Spark Streaming Processing Time vs Total Delay vs Processing Delay - apache-spark

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?
I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.
I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.
A live example from a working Streaming application:
We see that:
The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.

We're experiencing stable processing time, however increasing scheduling delay.
Based on the answer, the scheduling delay should be influenced only by processing time of previous runs.
Spark is running only streaming, nothing else.
Time window is 1 minute, processing 120K records.

If your window is 1 minute, and the average processing time is 1 minute 7 seconds, you have a problem : each batch will delay the next one by 7 seconds.
Your processing time graph shows a stable processing time, but always higher than batch time.
I think after a given amount of time, your driver will crash on GC overhead limit exceeded, as it will be full of pending batch waiting to be excecuted.
You can change this by reducing the processing time so that it goes under the expected microbatch max duration (requires code and/or resources allocation changes), or increase the microbatch size, or go to continuous streaming.
Rgds

Related

Delay in starting the next stage in Spark job

While looking into the stage details for a spark job which takes very long time than usual; it is observed that the 'stage n' does not start even after all the 'stages from 0 to n-1' have been completed.
The enclosed details are from the spark details of a job/build -> stage progress.
I am unable to get the reason behind this lag where the stage-8 starts after a long delay (12.48 AM vs 1.25 AM). As you can see; all the stages above 8 get completed in seconds or minutes and the delay of 37 minutes between the highlighted stages is something puzzles me.
Any help is highly appreciated.
It's possible that that lag between both stages is IO happening. I would recommend you partition your datasets so that each file has 128MB each. Opening, writing and closing 1884 files takes time, and with 5.2GB of size you could do this with around 40 files.
df.repartition(40)
should help.

Spark Structured Streaming metrics: Why process rate can be greater than input rate?

How come the process rate can be greater than the input rate?
From my understanding, process rate is the rate by which spark can process arriving data, ie, the process capacity. If so, the process rate must be on average lower or equal to the input rate. If it is lower, we know we need more processing power, or rethink about trigger time.
I am basing my understanding on this blog post and common sense, but I might be wrong. I looking for the formal formula in the source code while writing this question, as well.
This is an example where the process rate is constantly greater than the input rate:
You can see that on averege we have 200-300 records being processed per sec, whereas we have 80-120 records arriving per sec.
Setup background: Spark 3.x reading from Kafka and writing to Delta.
Thank you all.
Process rate more than input rate could mean its processing much faster than input rate. i.e it could process 300-400 per sec although event rate is 100 per sec. For ex: lets say ~100 per sec is the input rate and spark is able to process 100 records within half a sec than it means it can process 100 more in the next half of the sec and on an average this would lead to ~200 process rate.
In screenshot attached it could be interpreted as
It can process ~3000 records within each batch(~200*~15s) with 15s processing time for each batch (based on ~15000 ms seen in latency chart) but its processing around ~1000 records within each batch with 15s processing time.

Spark structured streaming asynchronous batch blocking

I’m using Apache Spark structured streaming for reading from Kafka. Sometimes my micro batches get processed in a greater time than specified, due to heavy writes IO operations. I was wondering if there’s an option of starting the next batch before the first one has finished, but make the second batch blocked by the first?
I mean that if the first one took 7 seconds and the batch is set for 5 seconds, then start the second batch on the fifth second. But if the second batch finishes block it so it won’t write before it’s previous batch (because of the will to keep the correct messages order).
No. Next batch only starts if previous completed. I think you mean term interval. It would become a mess otherwise.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

sub-second latency causing delay in spark application

I have a spark batch job that runs every minute and processes ~200k records per batch. The usual processing delay of the app is ~30 seconds. In the app, for each request, we make a write request to DynamoDB. At times, the server-side DDB write latency is ~5 ms instead of 3.5 ms (~30% increase w.r.t to usual latency 3.5ms). This is causing the overall delay of the app to bump by 6 times (~3 minutes).
How does sub-second latency of DDB call impact the overall latency of the app by 6 times?
PS: I have verified the root cause through overlapping the cloud-watch graphs of DDB put latency and the spark app processing delay.
Thanks,
Vinod.
Just a ballpark estimate:
If the average is 3.5 ms latency and about half of your 200k records are processed in 5ms instead of 3.5ms, this would leave us with:
200.000 * 0.5 * (5 - 3.5) = 150.000 (ms)
of total delay, which is 150 seconds or 2.5 minutes. I don't know how well the process is parallelized, but this seems to be within the expected delay.

Spark streaming slow down

In our spark app we're consuming Kafka stream and storing data to Cassandra DB.
First, we've run the stream without backpressure and experienced a weird anomaly where processing time was constant ~ 1 minute, however the scheduling delay was increasing. In this way the queue was piling up, eventually crashing the stream.
Any thoughts why this could be happening? If it's not the processing, what can cause such dramatic delays?
Then we tried the same setup with backpressure (with increased maxRatePerPartition), initially, everything was running well. Backpressure did its throttling job and we were able to process at a constant rate of ~ 100K / minute.
Then after few hours, something happened and the rate dropped rapidly to 5K / minute. The processing time was only 5-6 second with no scheduling delay, but backpressure absurdly kept the rate at 5k / minute and never increased. Actually, there was no reason to throttle down to 5K at all.
Our Setup:
Window: 1 minute
spark.streaming.kafka.maxRatePerPartition = 500 (4 partition * 60 sec * 500 = 120K / window)
spark.streaming.backpressure.enabled = true
spark.streaming.kafka.allowNonConsecutiveOffsets = true
spark.streaming.kafka.consumer.cache.enabled = false
Spark cluster with one master and 2 worker nodes

Resources