How to know if a machine in a Spark cluster 'participate's a job - apache-spark

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

Related

What is the difference between Driver and Application manager in spark

I couldn't figure out what is the difference between Spark driver and application master. Basically the responsibilities in running an application, who does what?
In client mode, client machine has the driver and app master runs in one of the cluster nodes. In cluster mode, client doesn't have any, driver and app master runs in same node (one of the cluster nodes).
What exactly are the operations that driver do and app master do?
References:
Spark Driver memory and Application Master memory
Spark yarn cluster vs client - how to choose which one to use?
As per the spark documentation
Spark Driver :
The Driver(aka driver program) is responsible for converting a user
application to smaller execution units called tasks and then schedules
them to run with a cluster manager on executors. The driver is also
responsible for executing the Spark application and returning the
status/results to the user.
Spark Driver contains various components – DAGScheduler,
TaskScheduler, BackendScheduler and BlockManager. They are responsible
for the translation of user code into actual Spark jobs executed on
the cluster.
Where in Application Master is
The Application Master is responsible for the execution of a single
application. It asks for containers from the Resource Scheduler
(Resource Manager) and executes specific programs on the obtained containers.
Application Master is just a broker that negotiates resources with the Resource Manager and then after getting some container it make sure to launch tasks(which are picked from scheduler queue) on containers.
In a nutshell Driver program will translate your custom logic into stages, job and task.. and your application master will make sure to get enough resources from RM And also make sure to check the status of your tasks running in a container.
as it is already said in your provided references the only different between client and cluster mode is
In client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes.
(AND)
In cluster mode driver run inside application master, it means the application has much more responsibility.
References :
https://luminousmen.com/post/spark-anatomy-of-spark-application#:~:text=The%20Driver(aka%20driver%20program,status%2Fresults%20to%20the%20user.
https://www.edureka.co/community/1043/difference-between-application-master-application-manager#:~:text=The%20Application%20Master%20is%20responsible,class)%20on%20the%20obtained%20containers.

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).
So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?
The Spark driver is a single point of failure because it holds all cluster state for the running App.
In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.
So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.
EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only.
So to handle your situation in a best way, you may consider to use both CORE and TASK group.
What you can do to tackle this -
MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.
This will reduce your cost also ensure that node containing the driver process never goes down.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

YARN and MapReduce Framework

I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.
With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.
Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.
Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?
As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?
Any clarification to understand this would be of great help.
If you take a look at this picture from Hadoop documentation:
You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
When applied to Spark, some of the components in that picture would be:
Client: the spark-submit process
App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
Container: spark workers
Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).
From Spark documentation on YARN deployment:
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN
You can extend this abstraction to map-reduce, given your understanding of that framework.
So when we submit a MapReduce job it will first go to the Resource Manager which is the master daemon of YARN. The Resource Manager then selects a Node Manager(which are slave processes of YARN) to start a container on which it will ask the Node Manager to start a very lightweight process known as Application Master. Then the Resource Manager will ask the Application Master to start execution of the job.
The Application Master will first go through the driver part of the job from where it would get to know of the resources that would be used for the job, and accordingly it will request the Resource Manager for those resources. Now the Resource Manager can assign the resources to the Application Master immediately or if the cluster is to occupied then that request would be rescheduled based on various scheduling algorithms.
After getting the resources the Application Master would go to the Name Node to get the metadata of all the blocks that would be required to be processed for this job.
After receiving the Metadata the Application Master would ask the Node Managers of the nodes where the blocks are stored(if those nodes are too busy then a node in the same rack, otherwise any random node depending on rack awareness) and ask the Node Managers to launch containers for processing their respective blocks.
The blocks would get processed independently and in parallel in their respective nodes. After the entire processing is done the result would be stored in HDFS.

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Resources