Configure for hadoop yarn -stop to gracefully shutdown app - apache-spark

I need to use yarn application -stop to gracefully stop my Spark application. According to this thread we need to set
yarn.application.admin.client.class.SPARK
first. But (1) how and where to set it?
Also in the thread, some framework implementation needs to be available, and again (2) how and to get it? There's scare information about the guide available.
Would you please give some help?

Related

How do you kill a Spark job from the CLI?

Killing Spark job using command Prompt
This is the thread that I hoped would answer my question. But all four answers explain how to kill the entire application.
How can I stop a job? Like a count for example?
I can do it from the Spark Web UI by clicking "kill" on the respective job. I suppose it must be possible to list running jobs and interact with them also directly via CLI.
Practically speaking I am working in a Notebook with PySpark on a Glue endpoint. If I kill the application the entire endpoint dies and I have to spin up a new cluster. I just want to stop a job. Cancelling it within the Notebook will just detach synchronization and the job keeps running, blocking any further commands from being executed.
Spark History Server provides REST API interface. Unfortunately, it only exposes monitoring capabilities for applications, jobs, stages, etc.
There is also a REST Submission interface that provides capabilities to submit, kill and check up on status of the applications. It is undocumented AFAIK, and is only supported on Spark standalone and Mesos clusters, no YARN. (Thats why there is no "kill" link in Jobs UI screen for Spark on YARN, I guess.)
So you can try using that "hidden" API, but if you know your application's Spark UI URL and job id of a job you want to kill, the easier way is something like:
$ curl -G http://<Spark-Application-UI-host:port>/jobs/job/kill/?id=<job_id>
Since I don't work with Glue I'd be interested to find out myself how its going to react, because the kill normally results in org.apache.spark.SparkException: Job <job_id> cancelled.
building on the answer by mazaneicha it appears that, for Spark 2.4.6 in standalone mode for jobs submitted in client mode, the curl request to kill an app with a known applicationID is
curl -d "id=<your_appID>&terminate=true" -X POST <your_spark_master_url>/app/kill/
We had a similar problem with people not disconnecting their notebooks from the cluster and hence hogging resources.
We get the list of running applications by parsing the webUI. I'm pretty sure there's less painful ways to manage a Spark cluster..
list the job in linux and kill it.
I would do
ps -ef |grep spark-submit
if it was started using spark-submit. Get the PID from the output and then
kill -9
Kill running job by:
open Spark application UI.
Go to jobs tab.
Find job among running jobs.
Click kill link and confirm.

How YARN decides which type of Application master to launch?

I referred to this link and got fair understanding on how YARN works. YARN is capable of having multi-tenant applications to run, for example, MR, Spark etc.
The key point is the Application specific ApplicationMaster (AM).
When a client submits Job to Resource Manager, how does Resource Manager know what kind of application it is (MR, Spark) and consequently launching appropriate ApplicationMaster?
Can anyone help how RM comes to know what kind of Job is being submitted to it?
EDIT:
This question is: How does RM knows what kind of Job has been submitted and not any relationship between YARN or MR or Spark.
RM receives a Job, so it has to launch a first Container which runs application specific ApplicationMaster, hence how does RM knows what kind of Job has been submitted to it?
This is the question I am asking, and this is not same what it has been made to be duplicate of.
YARN does not need/want to know about the type of application running on it. It provides resources and it's the concern of the application running on it to understand how to obtain resources from YARN in order to run what it needs to run (YARN's architecture does not suggest that yarn wants to know what/how tasks run on it).
There's more information here on how to write components that integrate with yarn.
As I understand from the 2-step YARN application writing, one needs to write a YARN client as well as a YARN Application master.
An application client determines what to run as application master:
// Construct the command to be executed on the launched container
String command =
"${JAVA_HOME}" + /bin/java" +
" MyAppMaster" +
" arg1 arg2 arg3" +
...
Where MyAppMaster is the application-specific master class.
The second thing is the task that runs in the container, note the kind of command that is provided by the application master to run the container (which runs the actual task executors):
// Set the necessary command to execute on the allocated container
String command = "/bin/sh ./MyExecShell.sh";
As you can see, these are application-provided code that know about the task (or type of application to use the question's words). Further down on the same page, you can see how applications can be submitted to yarn.
Now to put that in the perspective of Spark: Spark has its own application master class (check here or the entire package). These are hidden from the Spark application developer because the framework provides built-in integration with YARN, which happens to be just one of Spark's supported resources managers.
If you were to write your own YARN client that executes, say, Python code, then you'd have to follow the steps on the YARN application client/master documentation steps in order to supply YARN with the commands, configuration, resources, and environment to be used to execute your application's specific logic or tasks.

Shutdown spark structured streaming gracefully

There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala).
Is the shutdown process different in structured streaming? Or is it simply not implemented yet?
This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and offsets without any issues.
Try this github example code.
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/jobs/RealTimeFraudDetection/StructuredStreamingFraudDetection.scala#L76
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/GracefulShutdown.scala#L26
This Feature is not implemented yet and will also give you duplicates if you kill the job from the resource Manager while the batch is running.
Corrected: The duplicates will only be in the output directory, Write ahead logs handle everything beautifully; you don't need to worry about anything. Feel free to kill it any time.

Is there a way to monitor RAM and CPU usage of Apache Spark applications?

I need to monitor RAM and CPU of Spark Applications that run on a stand alone Spark Cluster.
I have try to use java console and it work very well, but I need to monitor various applicationi and I need to set for each one a different java console port .
Behind a firewalls it becomes a very long and tedious job.
Is there a way to monitor applications from Spark UI for example or something else?
If you use Ubuntu, you may use htop.
To install, do
sudo apt-get install htop

Make YARN clean up appcache before retry

The situation is the following:
A YARN application is started. It gets scheduled.
It writes a lot to its appcache directory.
The application fails.
YARN restarts it. It goes pending, because there is not enough disk space anywhere to schedule it. The disks are filled up by the appcache from the failed run.
If I manually intervene and kill the application, the disk space is cleaned up. Now I can manually restart the application and it's fine.
I wish I could tell the automated retry to clean up the disk. Alternatively I suppose it could count that used disk as part of the new allocation, since it belongs to the application anyway.
I'll happily take any solution you can offer. I don't know much about YARN. It's an Apache Spark application started with spark-submit in yarn-client mode. The files that fill up the disk are the shuffle spill files.
So here's what happens:
When you submit yarn application it creates a private local resource folder (appcache directory).
Inside this directory spark block manager creates directory for storing block data. As mentioned:
local directories and won't be deleted on JVM exit when using the external shuffle service.
This directory can be cleaned via:
Shutdown hook. This what's happen when you kill the application.
Yarn DeletionService. It should be done automatically on application finish. Make sure yarn.nodemanager.delete.debug-delay-sec=0. Otherwise there is some unresolved yarn bug

Resources