Scheduler delay time in spark and YARN - apache-spark

I'm doing some instrumentation in Spark and I've realised that some of my tasks take really long times to complete because the Scheduler Delay Time that can be extracted from the TaskMetrics.
I know there are some questions already about this topic like this What is scheduler delay in spark UI's event timeline but the answers have not been accepted and it says that a task waiting for an open slot is considered scheduler delay, which I think is not true (as far as I know if a task doesn't have a slot into an executor it doesn't start generating metrics).
I'm a bit confused with from where does this Delay really starts. I was wondering if this Delay time takes also into account the period between an app being accepted by the YARN client and submitting the first job of the app. Or in other words, between this moment where the app is accepted:
and this one where is running:

I checked directly by launching one app with few resources available in the cluster. It stayed in the queue until enough executors could be launched for the stage. Then the yarn.Client launched the stage in the cluster. The metrics in spark don't consider this time in the queue as any delay. Also it doesn't matter if you have more tasks than cores like the stack overflow answer I posted above. The tasks will be allocated in the executors as they become available.
In short, scheduler delay time only considers sending the task to the executor. If there is a delay in here, YARN is not the bottleneck but the load in the nodes involved ( normally the driver and the worker nodes with the executors for the app)

Related

How long will a spark job wait for resource from yarn when lack of resources?

When a Spark job can't get enough resource to start, and it hang there to wait.
How long will it wait? How can I control the timeout for a hanging spark job?
Thanks
Under YARN, a job can hang indefinitely waiting for resources. However, since large tasks are typically broken in smaller tasks they wouldn’t normally be blocked unless the smallest task exceeds the max container limit.

Does the successful tasks also gets reprocessed on an executor crash?

I am seeing about 3018 tasks failed for the job as about 4 executors died.
The Executors summary (as below in Spark UI) have a completely different statistics. Out of 3018, about 2994 properly completed. My question is,
Will they be re-tried again?
Is there a config to override/limit this?
After monitoring the job and manually validation the attempt counts event for successful tasks, realised
Will they be re-tried again?
- Yes, even the successful tasks are retried.
Is there a config to override/limit this?
- Did not find any config to override this behaviour.
If an executer (kubernetes pod) dies (like with an OOM or timeout), all the tasks, even if successfully completed are re-executed. One of the main reason is, the shuffle writes from the executers are lost with the executor itself!!!

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Spark Standalone cluster master web UI inaccessible after an application finishes

I have a spark application that finishes without error, but once it's done and saved all of its outputs and the process terminates, the Spark standalone cluster master process becomes a CPU hog, using 16 CPU's full time for hours, and the web UI becomes unresponsive. I have no idea what it could be doing, is there some complicated clean up step?
Some more details:
I've got a Spark standalone cluster (27 workers/nodes) that I've been successfully submitting jobs to for a while. I recently scaled up the size of my applications, the largest now takes 3.5 hours using 100 cores over 27 workers, and each worker has ~dozens of GB of shuffle read/write over the course of the job. Otherwise, the application is no different than the smaller jobs that have run successfully before.
This is a known issue with Spark's standalone cluster, and is caused by the massive event log created by large applications. You can read more at the issue tracking link below.
https://issues.apache.org/jira/browse/SPARK-12299
At the current time, the best work-around is to disable event logging for large jobs.

Does Spark's Fair Scheduler pool provides inter- or intra-application scheduling?

I am quite confused,because these pools are getting created for each spark application, and also if I keep minshare for a pool greater than the total cores of the cluster, the pool got created.
So if these pools are intra application do I need to, assign different pools to different spark jobs manually, because if I use sparkcontext.setlocalproperty for setting the pool, then all the stages of that application goes to that pool.
Point is that can we have jobs from two different application, to go in the same pool, so if I have application a1 and used sparkcontext.(pool,p1), and other application a2 and used sparkcontext.(pool,p1), would jobs for both applocation will go to the same pool p1 or p1 for a1 is different from p1 for a2.
As described in Spark's official documentation in Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
and later in the same document:
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
With that, the scheduling happens within resources given to a Spark application and how much it gets depends on CPUs/vcores and memory available in a cluster manager.
The Fair Scheduler mode is essentially for Spark applications with parallel jobs.

Resources