Submit & Kill Spark Application program programmatically from another application

Submit & Kill Spark Application program programmatically from another application - apache-spark

I am wondering if it is possible to submit, monitor & kill spark applications from another service.
My requirements are as follows:
I wrote a service that
parse user commands
translate them into understandable arguments to an already prepared Spark-SQL application
submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder
And plans to run generated applications' driver in cluster mode.
Other requirements needs:
Query about the applications status, for example, the percentage remains
Kill queries accrodingly
What I find in spark standalone documentation suggest kill application using:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
And should find the driver ID through the standalone Master web UI at http://<master url>:8080.
So, what am I supposed to do?
Related SO questions:
Spark application finished callback
Deploy Apache Spark application from another application in Java, best practice

You could use shell script to do this.
The deploy script:
#!/bin/bash
spark-submit --class "xx.xx.xx" \
--deploy-mode cluster \
--supervise \
--executor-memory 6G hdfs:///spark-stat.jar > output 2>&1
cat output
and you will get output like this:
16/06/23 08:37:21 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submission successfully created as driver-20160623083722-0026. Polling submission state...
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submitting a request for the status of submission driver-20160623083722-0026 in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: State of driver driver-20160623083722-0026 is now RUNNING.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Driver is running on worker worker-20160621162532-192.168.1.200-7078 at 192.168.1.200:7078.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160623083722-0026",
"serverSparkVersion" : "1.6.0",
"submissionId" : "driver-20160623083722-0026",
"success" : true
}
And based on this, create your kill driver script
#!/bin/bash
driverid=`cat output | grep submissionId | grep -Po 'driver-\d+-\d+'`
spark-submit --master spark://node-1:6066 --kill $driverid
Make sure given the script execute permission by using chmod +x

A "dirty" trick to kill spark apps is by kill the jps named SparkSubmit. The main problem is that the app will be "killed" but at spark master log it will appear as "finished"...
user#user:~$ jps
20894 Jps
20704 SparkSubmit
user#user:~$ kill 20704
To be honest I don't like this solution but by now is the only way I know to kill an app.

Here's what I do:
To submit apps, use the (hidden) Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
This way you get a DriverID (under submissionId) which you can use to kill your Job later (you shouldn't Kill the Application, specially if you're using "supervise" on Standalone mode)
This API also lets you query the Driver Status
Query status for apps using the (also hidden) UI Json API: http://[master-node]:[master-ui-port]/json/
This service exposes all information available on the master UI in JSON format.
You can also use the "public" REST API to query Applications on Master or Executors on each worker, but this won't expose Drivers (at least not as of Spark 1.6)

you can fire yarn commnds from processbuilder to list the applications and then filter based on your application name that is available with you, extract the appId and then use Yarn commands poll the status/kill etc.

You can find driver id in [spark]/work/. The id is the directory name. Kill the job by spark-submit.

I also have same kind of problem where I need to map my application-id and driver-id and add them a csv for other application availability in standalone mode
I was able to get application id easily by using command sparkContext.applicationId
In order to get driver-id I thought of using shell command pwd When your program runs the driver logs are written in directory named with driver-id So I extracted the folder name to get driver-id
import scala.sys.process._
val pwdCmd= "pwd"
val getDriverId=pwdCmd.!!
val driverId = get_driver_id.split("/").last.mkString

kill -9 $(jps | grep SparkSubmit | grep -Eo '[0-9]{1,7}')

Related

spark operator stuck with spark structured streaming app - not processing data, no errors in logs

I have a spark structured streaming application that work fine if run via spark submit and correctly processes data. If ran via spark operator, no data are processed and I am unable to find any error in logs.
kubectl get pods -n spark-operator
log from kubectl logs spark-minio-driver -n spark-operator
ends with running the jar
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator driver --properties-file /opt/spark/conf/spark.properties --class RunParseLogs local:///opt/spark/work-dir/LogParser.jar
but nothing further is seen in the logs. It seems stuck and based on the spark operator examples I would expect this log to continue with usual spark logs about running the application.
log from the spark operator kubectl logs latest-spark-operator-94bc4f779-dxhck -n spark-operator show one error below, but after resubmitting the app, this goes away:
This does not say much - what might be the issue? any other reasons why the app might be stuck?

Spark InProcessLauncher not picking up Hadoop config

I'm trying to submit a cluster-mode spark 2 application from a Java Spring app using InProcessLauncher. I was previously using the SparkLauncher class, which worked, but it fires up a long-lived SparkSubmit java process for each job, which was eating up too many resources with lots of jobs in play.
My code sets sparkLauncher.setMaster("yarn") and sparkLauncher.setDeployMode("cluster")
I set the HADOOP_CONF_DIR env variable to the directory containing my config (yarn-site.xml etc) before starting my Spring app, and it logs that it is picking up this variable:
INFO System Environment - HADOOP_CONF_DIR = /etc/hadoop/conf
Yet when it comes to submitting, I see INFO o.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 - i.e. it is using the default 0.0.0.0 rather than the actual ResourceManager IP, and of course it fails. It seems not to be picking up the Hadoop config.
I can submit jobs from the same shell directly using spark-submit, and even by directly invoking java -cp /usr/hdp/current/spark2-client/conf/:/usr/hdp/current/spark2-client/jars/*:/etc/hadoop/conf/ org.apache.spark.deploy.SparkSubmit .... So I'm not sure why my Spring App isn't picking up the same config.

I managed to get my app to pick up the hadoop config by adding the conf folders to the classpath. This is something spark-submit does for you when launching as a separate process, but doesn't happen when using InProcessLauncher.
Because my Spring Boot app is launched using -jar xxx.jar, I couldn't use -cp on the command line (cannot be combined with -jar), but had to add it to the manifest in the jar. I did this by adding the following to build.gradle (which is using the Spring Boot gradle plugin):
bootJar {
manifest {
attributes 'Class-Path': '/usr/hdp/current/spark2-client/conf/ /etc/hadoop/conf/'
}
}

spark-submit --status with mesos master returns nothing

I'd like to retrieve the status of a spark job running in cluster mode on a mesos master via the following:
spark-submit --master mesos://<ip>:7077 --status "driver-...-..."
It exits 0 with no logging, no matter what the driver's status is.
I know that it's doing something right, since if I run the command with a an invalid mesos ip/port, I get
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:165)
and if I run with an invalid submission id, I get
2018-10-02 18:47:01 ERROR RestSubmissionClient:70 - Error: Server responded with message of unexpected type SubmissionStatusResponse.
Any idea why spark-submit --status isn't returning anything?

I found a workaround by accessing the dispatcher's api directly:
curl -s "http://$DISPATCHER/v1/submissions/status/$SUBMISSION_ID"
Still no clear answer why spark-submit --status does not behave as documented though.

Not sure what version of spark you are using. My investigation is based on spark-2.4.0. The described behaviour is valid for both spark standalone and mesos deployment targets.
org.apache.spark.deploy.rest.RestSubmissionClient is used as the handler for rest submission requests and programmatically uses INFO level to log the response.
org.apache.spark.deploy.SparkSubmit is used as a main class when invoking spark-submit and its logger is the top level root logger for all other loggers.
Programatically, if specific logger for SparkSubmit is not set in conf/log4j.properties (the same holds when this file is absent) the default level is set to WARN.
Going further, in the absence of the specific logger for RestSubmissionClient it gets its root logger's level which is SparkSubmit's logger.
You can see errors because again WARN is default.
To be able to see the logs for rest submissions you may want to adjust ${SPARK_HOME}/conf/log4j.properties with either
log4j.logger.org.apache.spark.deploy.rest.RestSubmissionClient=INFO
or log4j.logger.org.apache.spark.deploy.rest=INFO for other classes in that package.

Add following
log4j.logger.org.apache.spark.deploy.rest.RestSubmissionClient=INFO and log4j.logger.org.apache.spark.deploy.rest=INFO
to log4j.properties
present under /etc/spark/conf location and again look for status
spark-submit --master spark://:6066 --status driver-20210516043704-0012

How to get status of Spark jobs on YARN using REST API?

A spark application can run many jobs. My spark is running on yarn. Version 2.2.0.
How to get job running status and other info for a given application id, possibly using REST API?
job like follows:
enter image description here

This might be late but putting it for convenience. Hope it helps. You can use below Rest API command to get the status of any jobs running on YARN.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343/state'
O/P - {"state":"RUNNING"}
Throughout the job cycle the state will vary from NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
You can use jq for a formatted output.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343'| jq .app.state
O/P - "RUNNING"

YARN has a Cluster Applications API. This shows the state along with other information. To use it:
$ curl 'RMURL/ws/v1/cluster/apps/APP_ID'
with your application id as APP_ID.
It provides:

Spark master won't show running application in UI when I use spark-submit for python script

The image shows 8081 UI. The master shows running application when I start a scala shell or pyspark shell. But when I use spark-submit to run a python script, master doesn't show any running application. This is the command I used: spark-submit --master spark://localhost:7077 sample_map.py. The web UI is at :4040. I want to know if I'm doing it the right way for submitting scripts or if spark-submit never really shows running application.
localhost:8080 or <master_ip>:8080 doesn't open for me but <master_ip>:8081 opens. It shows the executor info.
These are my configurations in spark-env.sh:
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_MASTER_WEBUI_PORT=4040
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/opt/worker
export SPARK_DAEMON_MEMORY=512m
export SPARK_LOCAL_DIRS=/tmp/spark
export SPARK_MASTER_IP 'splunk_dep'
I'm using CentOS , python 2.7 and spark-2.0.2-bin-hadoop2.7.

You can open spark master’s web UI, which is http://localhost:8080 by default to see running apps (in standalone cluster mode) :
If multiple apps are running - they will bind to port 4040, 4041, 4042 ...
You can access this interface by simply opening http://:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

For Local run use this:
val sparkConf = new SparkConf().setAppName("Your app Name").setMaster("local")
val sc = new SparkContext(sparkConf)
while you do sparkSubmit:
val sparkConf = new SparkConf().setAppName("Your app Name")
val sc = new SparkContext(sparkConf)
This won't work in local test but when you compile with this and spark submit job it will show in UI.
Hope this explains.

Are you accessing SPARK-UI when the application is running or after it completed its execution?
Try adding some code, which will wait for key-press (hence the spark execution won't end) - and see if it solve your problem.

You just go to localhost:8080 and check that there is one completed application which you submit.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Submit & Kill Spark Application program programmatically from another application - apache-spark

you can fire yarn commnds from processbuilder to list the applications and then filter based on your application name that is available with you, extract the appId and then use Yarn commands poll the status/kill etc.

You can find driver id in [spark]/work/. The id is the directory name. Kill the job by spark-submit.

kill -9 $(jps | grep SparkSubmit | grep -Eo '[0-9]{1,7}')

Related

spark operator stuck with spark structured streaming app - not processing data, no errors in logs

Spark InProcessLauncher not picking up Hadoop config

spark-submit --status with mesos master returns nothing

How to get status of Spark jobs on YARN using REST API?

Spark master won't show running application in UI when I use spark-submit for python script

Categories

Resources