Understanding Spark UI for a streaming application - apache-spark

I am trying to understand what the entries in my Spark UI signify.
Calling an action results in creation of a job. I am finding hard to understand
How many of these jobs get created? and is that proportional to the number of micro-batches?
What does the Duration column signify?
What is the effect of setting the batch duration when instantiating the streaming context? Where is that visible in the Spark UI?
new StreamingContext(sparkSession.sparkContext, Seconds(50))

1.The jobs are proportional to the micro batches,say your streaming context time is 50 sec ,then you will have 2 jobs in a minute
2.Duration, specifies the amount of time taken to process a single micro batch or job.Ideally the duration taken to process a micro batch should be less than time specified for the micro batches.Say if its 50sec , each micro batch job should be complete well within that time
3.When you take the streaming option in the UI when the job is running , you can see that each micro batch is created in an interval of 50 sec
When you click on a job , you get the details of stages of that single micro-batch/job.I guess you have shared the screen hot of the same.Here the duration points to the time taken by each stage in the job to complete

Related

What is the difference between duration vs processing time vs batch duration in spark ui?

As the picture below, what's the difference between duration, batch duration and processing time in spark UI ?
thanks
Spark UI Picture
The batch duration of 1 min tells you, that your Spark streaming application works in batches of 1 minute, meaning it plans an RDD every minute. You set this duration in your code when creating the SparkSession.
The processing time tells you that it took Spark 34 seconds to process all input data (provided as input data size).
The duration gives you an understanding of the time it took to finish a particular job within your application.
Duration is wall clock. Processing time is the sum of all jobs durations.

5 Minutes Spark Batch Job vs Streaming Job

I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.

Spark Streaming - Job Duration vs Submitted

I am trying to optimize a Spark Streaming application which collects data from a Kafka cluster, processes it and saves the results to various database tables. The Jobs tab in the Spark UI shows the duration of each job as well as the time it was submitted.
I would expect that for a specific batch, a job starts processing when the previous job is done. However, in the attached screenshot, the "Submitted" time of a job is not right after the previous job finishes. For example, job 1188 has a duration of 1 second and it was submitted at 12:02:12. I would expect that the next job would be submitted one second later, or at least close to it, but instead it was submitted six seconds later.
Any ideas on how this delay can be explained? These jobs belong to the same batch and are done sequentially. I know that there is some scheduling delay between jobs and tasks, but I would not expect it to be that large. Moreover, the Event Timeline of a Stage does not show large Scheduling Delay.
I am using Pyspark in a Standalone mode.

Dynamically update batch size of input for spark kafka consumer

I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time the batch job can finish within about 5 seconds. However, in very rare cases, the batch job need cost 60 seconds and this will delay some batch of jobs.
To cut down the total delay time, I hope I can process more streaming data which spread over the delayed jobs at one time. This will help the streaming return to normal as soon as possible.
So, I want to know there is some method to dynamically update/merge batch size of input for spark and kafka when delay appears.
You can set the "spark.streaming.backpressure.enabled" option to true.
If the batch delay occurs when the backpressure option is true, it initially starts with a small batch size and then dynamically changes to a large batch size.
See the spark configuration document.
You can see the description below.
Enables or disables Spark Streaming's internal backpressure mechanism
(since 1.5). This enables the Spark Streaming to control the receiving
rate based on the current batch scheduling delays and processing times
so that the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).

Why are the durations of tasks belong to the same job are quite different in spark streaming?

Look at the picture below, these 24 tasks belong to a same job and
the amount of data to be processed for each task is basically the same and time used to gc is very short, my question is why are the durations of tasks belong to the same job are so different?
May be you can try and check Event Timeline for tasks in your spark UI. Check why slow task are running slow.
Are they taking more time in serialization/deserialization?
Is it because of scheduler delay?
or the executor computation time?

Resources