What is scheduler delay in spark UI's event timeline

I am using YARN environment to run spark programs,
with option --master yarn-cluster.
When I open a spark application's application master, I saw a lot of Scheduler Delay in a stage. Some of them are even more than 10 minutes. I wonder what are they and why it takes so long?
Usually operations like aggregateByKey take much more time (i.e. scheduler delay) before executors really start doing tasks. Why is it?

Open the "Show Additional Metrics" (click the right-pointing triangle so it points down) and mouse over the check box for "Scheduler Delay". It shows this tooltip:
Scheduler delay includes time to ship the task from the scheduler to the executor, and time to send the task result from the executor to
the scheduler. If scheduler delay is large, consider decreasing the
size of tasks or decreasing the size of task results.
The scheduler is part of the master that divides the job into stages of tasks and works with the underlying cluster infrastructure to distribute them around the cluster.

Have a look at TaskSetManager's class comment:
..Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling...
I assume it is the result of the following paper, on which Matei Zaharia was working (co-founder and Chief Technologist of Databricks which develop Spark) ,too: https://cs.stanford.edu/~matei/
Thus, Spark is checking the partition's locality of a pending task. If the locality-level is low (e.g. not on local jvm) the task gets not directly killed or ignored, Instead it gets a launch delay, which is fair.


Limit cores per Apache Spark job

I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs

Spark not using all CPUs available

I am running a query using Hive on Spark which is exhibiting some strange behavior. I've run it multiple times and observed the same behavior. The query:
reads from a large Hive external table
Spark creates about ~990,000 tasks
runs in a YARN queue with > 2900 CPUs available
uses 700 executors with 4 CPUs per executor
All is well at the start of the job. After ~1.5 hours of 2800 CPUs cranking, the job is ~80% complete (800k/990k tasks). From there, things start to nosedive: Spark stops using all of the CPUs available to it to work on tasks. With ~190k tasks to go, Spark will gradually drop from using 2800 CPUs to double digits (usually bottoming out around 20 total CPUs). This makes the last 190k tasks take significantly longer to finish than the previous 800k.
I could see as the job got very close to completing that Spark would be unable to parallelize a small amount of remaining tasks across a large number of CPUs. But with 190k tasks left to be started, it seems way too early for that.
Things I've checked:
No other job is pre-empting its resources in YARN. (In addition, if this were the case, I would expect the job to randomly lose/regain resources, instead of predictably losing steam at the 80% mark).
This occurs whether dynamic allocation is enabled or disabled. If disabled, Spark has all 2800 CPUs available for the entire run time of the job - it just doesn't use them. If enabled, Spark does spin down executors as it decides it no longer needs them.
If data skew were the issue, I could see some tasks taking longer than others to finish. But it doesn't explain why Spark wouldn't be using idle CPUs to start on the backlog of tasks still to go.
Does anyone have any advice?
For posterity, this answer from Travis Hegner contained the answer.
Setting spark.locality.wait=0s fixes this issue. I'm also not sure why a 3 second wait causes such a pile up in Spark's ability to schedule tasks, but setting to 0 makes the job run extremely well.

Why is there a delay in the launch of spark executors?

While trying to optimise a Spark job, I am having trouble understanding a delay of 3-4s in the launch of the second and of 6-7s third and fourth executors.
This is what I'm working with:
Spark 2.2
Two worker nodes having 8 cpu cores each. (master node separate)
Executors are configured to use 3 cores each.
Following is the screenshot of the jobs tab in Spark UI.
The job is divided into three stages. As seen, second, third and fourth executors are added only during the second stage.
Following is the snap of the Stage 0.
And following the snap of the Stage 1.
As seen in the image above, executor 2 (on the same worker as the first) takes around 3s to launch. Executors 3 and 4 (on the second worker) taken even longer, approximately 6s.
I tried playing around with the spark.locality.wait variable : values of 0s, 1s, 1ms. But there does not seem to be any change in the launch times of the executors.
Is there some other reason for this delay? Where else can I look to understand this better?
You might be interested to check Spark's executor request policy, and review the settings spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout for your application.
A Spark application with dynamic allocation enabled requests
additional executors when it has pending tasks waiting to be
scheduled. ...
Spark requests executors in rounds. The actual request is triggered
when there have been pending tasks for
spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then
triggered again every
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds
thereafter if the queue of pending tasks persists. Additionally, the
number of executors requested in each round increases exponentially
from the previous round. For instance, an application will add 1
executor in the first round, and then 2, 4, 8 and so on executors in
the subsequent rounds.
Another potential source for a delay could be spark.locality.wait. Since in Stage 1 you have quite a bit of tasks with sub-optimal locality levels (Rack local: 59), and the default for spark.locality.wait is 3 seconds, it could actually be the primary reason for the delays that you're seeing.
It takes time for the yarn to create the executors, Nothing can be done about this overhead. If you want to optimize you can set up a Spark server and then create requests for the server, And this saves the warm up time.

Could FAIR scheduling mode make Spark Streaming jobs that read from different topics running in parallel?

I use Spark 2.1 and Kafka 0.9.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish.
According to this if i have multiple jobs from multiple threads in case of spark streaming(one topic from each thread) is it possible that multiple topics can run simultaneously if i have enough cores in my cluster or would it just do a round robin across pools but run only one job at a time ?
I have two topics T1 and T2, both with one 1 partition. I have configured a pool with scheduleMode to be FAIR. I have 4 cores registered with spark. Now each topic has two actions(hence two jobs - totally 4 jobs across topics) Let's say J1 and J2 are jobs for T1 and J3 and J4 are jobs for topic T2. What spark is doing in FAIR mode is execute J1 J3 J2 J4, but at any time only one job is executing. Now as each topic has only one partition, only once core is being used and 3 are just free. This is something which i don't want.
Any way i can avoid this ?
if i have multiple jobs from multiple threads...is it possible that multiple topics can run simultaneously
Yes. That's the purpose of FAIR scheduling mode.
As you may have noticed, I removed "Spark Streaming" from your question since it does not contribute in any way to how Spark schedules Spark jobs. It does not really matter whether you start your Spark jobs from a "regular" application or Spark Streaming one.
Quoting Scheduling Within an Application (highlighting mine):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
And then the quote you used to ask the question that should now get clearer.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources.
So, speaking about Spark Streaming you'd have to configure FAIR scheduling mode and Spark Streaming's JobScheduler should submit Spark jobs per topic in parallel (haven't tested it out myself so it's more theory than practice).
I think that fair scheduler alone will not help, as it's the Spark Streaming engine that takes care of submitting the Spark Jobs and normally does so in a sequential mode.
There's a non-documented configuration parameter in Spark Streaming: spark.streaming.concurrentJobs[1], which is set to 1 by default. It controls the parallelism level of jobs submitted to Spark.
By increasing this value, you may see parallel processing of the different spark stages of your streaming job.
I would think that combining this configuration with the fair scheduler in Spark, you will be able to achieve controlled parallel processing of the independent topic consumers. This is mostly uncharted territory.

Why are the durations of tasks belong to the same job are quite different in spark streaming?

Look at the picture below, these 24 tasks belong to a same job and
the amount of data to be processed for each task is basically the same and time used to gc is very short, my question is why are the durations of tasks belong to the same job are so different?
May be you can try and check Event Timeline for tasks in your spark UI. Check why slow task are running slow.
Are they taking more time in serialization/deserialization?
Is it because of scheduler delay?
or the executor computation time?
