Spark Streaming - Job Duration vs Submitted - apache-spark

I am trying to optimize a Spark Streaming application which collects data from a Kafka cluster, processes it and saves the results to various database tables. The Jobs tab in the Spark UI shows the duration of each job as well as the time it was submitted.
I would expect that for a specific batch, a job starts processing when the previous job is done. However, in the attached screenshot, the "Submitted" time of a job is not right after the previous job finishes. For example, job 1188 has a duration of 1 second and it was submitted at 12:02:12. I would expect that the next job would be submitted one second later, or at least close to it, but instead it was submitted six seconds later.
Any ideas on how this delay can be explained? These jobs belong to the same batch and are done sequentially. I know that there is some scheduling delay between jobs and tasks, but I would not expect it to be that large. Moreover, the Event Timeline of a Stage does not show large Scheduling Delay.
I am using Pyspark in a Standalone mode.

Related

Spark Streaming first job requires more time than following jobs

I noticed that when I start the Spark Streaming application the first job takes more time than the following ones even when there are no input data. I also noticed that the first job when input data arrives requires a processing time greater than the following. Is there a reason for this behavior?
Thank You

Understanding Spark UI for a streaming application

I am trying to understand what the entries in my Spark UI signify.
Calling an action results in creation of a job. I am finding hard to understand
How many of these jobs get created? and is that proportional to the number of micro-batches?
What does the Duration column signify?
What is the effect of setting the batch duration when instantiating the streaming context? Where is that visible in the Spark UI?
new StreamingContext(sparkSession.sparkContext, Seconds(50))
1.The jobs are proportional to the micro batches,say your streaming context time is 50 sec ,then you will have 2 jobs in a minute
2.Duration, specifies the amount of time taken to process a single micro batch or job.Ideally the duration taken to process a micro batch should be less than time specified for the micro batches.Say if its 50sec , each micro batch job should be complete well within that time
3.When you take the streaming option in the UI when the job is running , you can see that each micro batch is created in an interval of 50 sec
When you click on a job , you get the details of stages of that single micro-batch/job.I guess you have shared the screen hot of the same.Here the duration points to the time taken by each stage in the job to complete

spark streaming - kafka direct stream - Can finished batches re-run after restoring from checkpoint

Consider a scenario where I've enabled concurrent jobs, so my batches could execute in any order and not wait for previous batches to finish. So, what happens when my driver failure occurs while my batch at time t is still executing and the batch at t+1 has already finished execution. Assuming that checkpointing is enabled, does my job just launch only one job for the pending batch at time t? and not bother about the batch at time t+1? Or does it consider the batch at time t+1 also as incomplete? I'm interested in knowing this because, I would like my output operations on the stream to write data in the same order as in my input.

How to write functional Test with spark

I have a spark batch job that talks with cassandra. After the batch job gets completed , I need to verify few entries in cassandra and the cycle continues for 2-3 times. How do I know when the batch job ends ? I don't want to track the status of batch job by adding entry in db.
How to write functional test in spark ?

Why are the durations of tasks belong to the same job are quite different in spark streaming?

Look at the picture below, these 24 tasks belong to a same job and
the amount of data to be processed for each task is basically the same and time used to gc is very short, my question is why are the durations of tasks belong to the same job are so different?
May be you can try and check Event Timeline for tasks in your spark UI. Check why slow task are running slow.
Are they taking more time in serialization/deserialization?
Is it because of scheduler delay?
or the executor computation time?

Resources