Get driver id in Spark Standalone cluster in cluster mode - apache-spark

How can i get the driver id of the already submitted/running application, programmatically or if there is any end point which i am missing ?
I have gone through the Spark hidden REST APIs which gives only three end points: /create /status /kill . Now, to kill the application, i need the driver id/submission id of the application.
EDIT :

Related

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

Apache Spark - How to get app-id from submissionId over Rest API?

I want to get the progress of an application running at Apache Spark standalone cluster. Is there any way to get the app-id for this pupose over REST API? eg;
http://spark-master:6066/v1/submissions/create
The response contains a submission id, eg;
"submissionId" : "driver-20171222112336-0208"
When the status of the driver is requested using that submission id, eg;
http://spark-master:6066/v1/submissions/status/driver-20171222112336-0208
The response (after the driver state changed into running) contains Worker Host & Port and a worker id, eg;
"workerHostPort": "192.168.1.20:43817",
"workerId": "worker-20170707120806-192.168.1.20-43817"
After this progress when the applications listed at that worker, eg;
http://192.168.1.20:4040/api/v1/applications
This seems to be giving a list of applications (a json array with one element) eg;
[
{
"id": "app-20171222112343-0125",
"name": "test",
"attempts": [
..
At this step a list of applications found instead of an app-id and without any information related to driver.
http://spark-master:8080/json
also gives a list of active applications and again without any information related to drivers, eg;
{
..
"activeapps": [..],
..
So, For a Standalone Apache Spark cluster, how to get application id [app-id] when a job is submitted to the Spark via hidden REST API?

pass custom exitcode from yarn-cluster mode spark to CLI

I started a yarn cluster mode spark job through spark-submit.
To indicate partial failure etc I want to pass exitcode from driver to script calling spark-submit.
I tried both, System.exit and throwing SparkUserAppException in driver, but in both cases CLI only got 1, not what exitcode I passed.
I think it is impossible to pass custom exitcode, since any exitcode passed by driver will be converted to yarn status and yarn will convert any failed exitCode to 1 or failed.
By looking at spark code, I can conclude this:
It is possible in client mode. Look at runMain() method of SparkSubmit class
Whereas in cluster mode, it is not possible to get the exit status of the driver because your driver class will be running in one of the executors.
There an alternate solution that might/might not be suitable for your use case:
Host a REST API with an endpoint to receive the status update from your driver code. In the case of any exceptions, let your driver code use this endpoint to update the status.
You can save the exit code in the output file (on HDFS or local FS) and make your script wait for this file appearance, read and proceed. This is definitely is not an elegant way, but it may help you to proceed.
When saving file, pay attention to the permissions to this location. Your spark process has to have RW access.

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

Stopping a Running Spark Application

I'm running a Spark cluster in standalone mode.
I've submitted a Spark application in cluster mode using options:
--deploy-mode cluster –supervise
So that the job is fault tolerant.
Now I need to keep the cluster running but stop the application from running.
Things I have tried:
Stopping the cluster and restarting it. But the application resumes
execution when I do that.
Used Kill -9 of a daemon named DriverWrapper but the job resumes again after that.
I've also removed temporary files and directories and restarted the cluster but the job resumes again.
So the running application is really fault tolerant.
Question:
Based on the above scenario can someone suggest how I can stop the job from running or what else I can try to stop the application from running but keep the cluster running.
Something just accrued to me, if I call sparkContext.stop() that should do it but that requires a bit of work in the code which is OK but can you suggest any other way without code change.
If you wish to kill an application that is failing repeatedly, you may do so through:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
You can find the driver ID through the standalone Master web UI at http://:8080.
From Spark Doc
Revisiting this because I wasn't able to use the existing answer without debugging a few things.
My goal was to programmatically kill a driver that runs persistently once a day, deploy any updates to the code, then restart it. So I won't know ahead of time what my driver ID is. It took me some time to figure out that you can only kill the drivers if you submitted your driver with the --deploy-mode cluster option. It also took me some time to realize that there was a difference between application ID and driver ID, and while you can easily correlate an application name with an application ID, I have yet to find a way to divine the driver ID through their api endpoints and correlate that to either an application name or the class you are running. So while run-class org.apache.spark.deploy.Client kill <master url> <driver ID> works, you need to make sure you are deploying your driver in cluster mode and are using the driver ID and not the application ID.
Additionally, there is a submission endpoint that spark provides by default at http://<spark master>:6066/v1/submissions and you can use http://<spark master>:6066/v1/submissions/kill/<driver ID> to kill your driver.
Since I wasn't able to find the driver ID that correlated to a specific job from any api endpoint, I wrote a python web scraper to get the info from the basic spark master web page at port 8080 then kill it using the endpoint at port 6066. I'd prefer to get this data in a supported way, but this is the best solution I could find.
#!/usr/bin/python
import sys, re, requests, json
from selenium import webdriver
classes_to_kill = sys.argv
spark_master = 'masterurl'
driver = webdriver.PhantomJS()
driver.get("http://" + spark_master + ":8080/")
for running_driver in driver.find_elements_by_xpath("//*/div/h4[contains(text(), 'Running Drivers')]"):
for driver_id in running_driver.find_elements_by_xpath("..//table/tbody/tr/td[contains(text(), 'driver-')]"):
for class_to_kill in classes_to_kill:
right_class = driver_id.find_elements_by_xpath("../td[text()='" + class_to_kill + "']")
if len(right_class) > 0:
driver_to_kill = re.search('^driver-\S+', driver_id.text).group(0)
print "Killing " + driver_to_kill
result = requests.post("http://" + spark_master + ":6066/v1/submissions/kill/" + driver_to_kill)
print json.dumps(json.loads(result.text), indent=4)
driver.quit()
https://community.cloudera.com/t5/Support-Questions/What-is-the-correct-way-to-start-stop-spark-streaming-jobs/td-p/30183
according this link use to stop if your master use yarn
yarn application -list
yarn application -kill application_id

Resources