What is 'Active Jobs' in Spark History Server Spark UI Jobs section - apache-spark

I'm trying to understand Spark History server components.
I know that, History server shows completed Spark applications.
Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section.
Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h.
Please see the screenshot.
Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?

Finally after some research, found answer for my question.
A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.
The executors run tasks assigned by the driver.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
YARN application has a yarn client, yarn application master and list of container running on node managers.
In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.
This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.
If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.
If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.
Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.
So my take away from this is:
To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command.
After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.

Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.
A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.
each type of Spark Stages in detail:
a. ShuffleMapStage in Spark
ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.
Basically, it produces data for another stage(s).
consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages.
However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage.
like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.
b. ResultStage in Spark
By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.
coming back to the question of active jobs on history sever there some notes listed on official docs
as history server.Also there is jira [SPARK-7889] issue regarding the same link.
for more details follow the link
source-1

Related

YARN and MapReduce Framework

I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.
With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.
Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.
Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?
As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?
Any clarification to understand this would be of great help.
If you take a look at this picture from Hadoop documentation:
You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
When applied to Spark, some of the components in that picture would be:
Client: the spark-submit process
App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
Container: spark workers
Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).
From Spark documentation on YARN deployment:
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN
You can extend this abstraction to map-reduce, given your understanding of that framework.
So when we submit a MapReduce job it will first go to the Resource Manager which is the master daemon of YARN. The Resource Manager then selects a Node Manager(which are slave processes of YARN) to start a container on which it will ask the Node Manager to start a very lightweight process known as Application Master. Then the Resource Manager will ask the Application Master to start execution of the job.
The Application Master will first go through the driver part of the job from where it would get to know of the resources that would be used for the job, and accordingly it will request the Resource Manager for those resources. Now the Resource Manager can assign the resources to the Application Master immediately or if the cluster is to occupied then that request would be rescheduled based on various scheduling algorithms.
After getting the resources the Application Master would go to the Name Node to get the metadata of all the blocks that would be required to be processed for this job.
After receiving the Metadata the Application Master would ask the Node Managers of the nodes where the blocks are stored(if those nodes are too busy then a node in the same rack, otherwise any random node depending on rack awareness) and ask the Node Managers to launch containers for processing their respective blocks.
The blocks would get processed independently and in parallel in their respective nodes. After the entire processing is done the result would be stored in HDFS.

How to tell if your spark job is waiting for resources

I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Spark on Yarn: How to prevent multiple spark jobs being scheduled

With spark on yarn - I dont see a way to prevent concurrent jobs being scheduled. I have my architecture setup for doing purely batch processing.
I need this for the following reasons:
Resource Constraints
UserCache for spark grows really quickly. Having multiple jobs run causes an explosion of space on cache.
Ideally I'd love to see if there is a config that would ensure only one job to run at any time on Yarn.
You can run create a queue which can host only one application master and run all Spark jobs on that queue. Thus, if a Spark job is running the other will be accepted but they won't be scheduled and running until the running execution has finished...
Finally found the solution - was in yarn documents: yarn.scheduler.capacity.max-applications has to be set to 1 instead of 10000.

Is there a way to understand which executors were used for a job in java/scala code?

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Resources