How to get status of Spark jobs on YARN using REST API?

How to get status of Spark jobs on YARN using REST API? - apache-spark

A spark application can run many jobs. My spark is running on yarn. Version 2.2.0.
How to get job running status and other info for a given application id, possibly using REST API?
job like follows:
enter image description here

This might be late but putting it for convenience. Hope it helps. You can use below Rest API command to get the status of any jobs running on YARN.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343/state'
O/P - {"state":"RUNNING"}
Throughout the job cycle the state will vary from NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
You can use jq for a formatted output.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343'| jq .app.state
O/P - "RUNNING"

YARN has a Cluster Applications API. This shows the state along with other information. To use it:
$ curl 'RMURL/ws/v1/cluster/apps/APP_ID'
with your application id as APP_ID.
It provides:

Related

How to check yarn logs application id

I am trying to running a bash script to run spark-submit and run a pyspark script but it was not successful. I want to check yarn logs using "yarn logs -applicationId ". My question is how can I find the appropriate application id?
Below is some parts of the error I got

1. Using Yarn Logs:
In logs you can see tracking URL: http://<nn>:8088/proxy/application_*****/
If you copy and open the link you can see all the logs for the application in Resourcemanager.
2.Using Spark application:
From sparkContext we can get the applicationID.
print(spark.sparkContext.aplicationId)
3. Using yarn application command:
Use yarn application --list command to get all the running yarn applications on the cluster then use
yarn application --help
-appStates <States> Works with -list to filter applications
based on input comma-separated list of
application states. The valid application
state can be one of the following:
ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN
NING,FINISHED,FAILED,KILLED
-appTypes <Types> Works with -list to filter applications
based on input comma-separated list of
application types.
-help Displays help for all commands.
-kill <Application ID> Kills the application.
-list List applications. Supports optional use
of -appTypes to filter applications based
on application type, and -appStates to
filter applications based on application
state.
-movetoqueue <Application ID> Moves the application to a different
queue.
-queue <Queue Name> Works with the movetoqueue command to
specify which queue to move an
application to.
-status <Application ID> Prints the status of the application.
List all the finished applications:
yarn application -appStates FINISHED -list

You can also use curl to get required details of your application using YARN Rest API.
state="RUNNING" // RUNNING, FAILED, COMPLETED.
user="" // userid from where you started job.
applicationTypes="spark" // Type of application
applicationName="<application_name>" // Your application name
url="http://<host_name>:8088/ws/v1/cluster/apps?state=${state}&user=${user}&applicationTypes=${applicationTypes}" // Build Rest API
applicationId=$(curl "${url}" | python -m json.tool | jq -r '.apps."app" | .[] | select(.name | contains('\"${applicationName}\"')) | .id')
Output
> echo $applicationId
application_1593019621736_42096

Read Large Data Set to Jupyter Notebook and Manipulate

I am trying to load data from BigQuery to Jupyter Notebook, where I will do some manipulation and plotting. The datasets is 25 millions rows with 10 columns, which definitely exceeds my machine's memory capacity(16 GB).
I have read this post about using HDFStore, but the problem here is that I still need to read the data to Jupyter Notebook to do the manipulation.
I am using Google Cloud Platform, so setting a huge cluster in Dataproc might be an option, though that could be costly.
Anyone gets similar issue and has a solution?

Concerning products within Google Cloud Platform you can create a Datalab instance to run your notebooks and specify the desired machine type with the --machine-type flag (docs). You can use a high-memory machine if needed.
Of course, you can also use Dataproc as you already proposed. For easier setup you can use the predefined initialization action with the following parameter upon cluster creation:
--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh
Edit
As you are using a GCE instance, you can also use a script to autoshutdown the VM when you are not using it. You can edit ~/.bash_logout so that it checks if it's the last session and, if so, stops the VM
if [ $(who|wc -l) == 1 ];
then
gcloud compute instances stop $(hostname) --zone $(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4) --quiet
fi
Or, if you prefer a curl approach:
curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" https://www.googleapis.com/compute/v1/projects/$(gcloud config get-value project 2>\dev\null)/zones/$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4)/instances/$(hostname)/stop -d ""
Keep in mind that you might need to update Cloud SDK components to get the gcloud command to work. Either use:
gcloud components update
or
sudo apt-get update && sudo apt-get --only-upgrade install kubectl google-cloud-sdk google-cloud-sdk-datastore-emulator google-cloud-sdk-pubsub-emulator google-cloud-sdk-app-engine-go google-cloud-sdk-app-engine-java google-cloud-sdk-app-engine-python google-cloud-sdk-cbt google-cloud-sdk-bigtable-emulator google-cloud-sdk-datalab -y
You can include one of these and ~/.bash_logout edits in your startup-script.

Spark Job fails connecting to oracle in first attempt

We are running spark job which connect to oracle and fetch some data. Always attempt 0 or 1 of JDBCRDD task fails with below error. In subsequent attempt task get completed. As suggested in few portal we even tried with -Djava.security.egd=file:///dev/urandom java option but it didn't solved the problem. Can someone please help us in fixing this issue.
ava.sql.SQLRecoverableException: IO Error: Connection reset by peer, Authentication lapse 59937 ms.
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:794)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688)

Issue was with java.security.egd only. Setting it through command line i.e -Djava.security.egd=file:///dev/urandom was not working so I set it through system.setproperty with in job. After that job is no more giving SQLRecoverableException

This Exception nothing to do with Apache Spark ,"SQLRecoverableException: IO Error:" is simply the Oracle JDBC driver reporting that it's connection
to the DBMS was closed out from under it while in use. The real porblem is at
the DBMS, such as if the session died abruptly. Please check DBMS
error log and share with question.
Similer problem you can find here
https://access.redhat.com/solutions/28436

Fastest way is export spark system variable SPARK_SUBMIT_OPTS before running your job.
like this: export SPARK_SUBMIT_OPTS=-Djava.security.egd=file:dev/urandom I'm using docker, so for me full command is:
docker exec -it spark-master
bash -c "export SPARK_SUBMIT_OPTS=-Djava.security.egd=file:dev/urandom &&
/spark/bin/spark-submit --verbose --master spark://172.16.9.213:7077 /scala/sparkjob/target/scala-2.11/sparkjob-assembly-0.1.jar"
export variable
submit job

How to get reports of a puppet node using puppetdb api?

How do I get reports from a puppet node using puppetdb report api?

After playing around with the query parameters, it seems like following will do the work:
curl -X GET puppet.foo.com:8180/pdb/query/v4/reports -d 'limit=1' -d 'query=["=", "certname", "node.foo.com"]'

Submit & Kill Spark Application program programmatically from another application

I am wondering if it is possible to submit, monitor & kill spark applications from another service.
My requirements are as follows:
I wrote a service that
parse user commands
translate them into understandable arguments to an already prepared Spark-SQL application
submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder
And plans to run generated applications' driver in cluster mode.
Other requirements needs:
Query about the applications status, for example, the percentage remains
Kill queries accrodingly
What I find in spark standalone documentation suggest kill application using:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
And should find the driver ID through the standalone Master web UI at http://<master url>:8080.
So, what am I supposed to do?
Related SO questions:
Spark application finished callback
Deploy Apache Spark application from another application in Java, best practice

You could use shell script to do this.
The deploy script:
#!/bin/bash
spark-submit --class "xx.xx.xx" \
--deploy-mode cluster \
--supervise \
--executor-memory 6G hdfs:///spark-stat.jar > output 2>&1
cat output
and you will get output like this:
16/06/23 08:37:21 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submission successfully created as driver-20160623083722-0026. Polling submission state...
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submitting a request for the status of submission driver-20160623083722-0026 in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: State of driver driver-20160623083722-0026 is now RUNNING.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Driver is running on worker worker-20160621162532-192.168.1.200-7078 at 192.168.1.200:7078.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160623083722-0026",
"serverSparkVersion" : "1.6.0",
"submissionId" : "driver-20160623083722-0026",
"success" : true
}
And based on this, create your kill driver script
#!/bin/bash
driverid=`cat output | grep submissionId | grep -Po 'driver-\d+-\d+'`
spark-submit --master spark://node-1:6066 --kill $driverid
Make sure given the script execute permission by using chmod +x

A "dirty" trick to kill spark apps is by kill the jps named SparkSubmit. The main problem is that the app will be "killed" but at spark master log it will appear as "finished"...
user#user:~$ jps
20894 Jps
20704 SparkSubmit
user#user:~$ kill 20704
To be honest I don't like this solution but by now is the only way I know to kill an app.

Here's what I do:
To submit apps, use the (hidden) Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
This way you get a DriverID (under submissionId) which you can use to kill your Job later (you shouldn't Kill the Application, specially if you're using "supervise" on Standalone mode)
This API also lets you query the Driver Status
Query status for apps using the (also hidden) UI Json API: http://[master-node]:[master-ui-port]/json/
This service exposes all information available on the master UI in JSON format.
You can also use the "public" REST API to query Applications on Master or Executors on each worker, but this won't expose Drivers (at least not as of Spark 1.6)

you can fire yarn commnds from processbuilder to list the applications and then filter based on your application name that is available with you, extract the appId and then use Yarn commands poll the status/kill etc.

You can find driver id in [spark]/work/. The id is the directory name. Kill the job by spark-submit.

I also have same kind of problem where I need to map my application-id and driver-id and add them a csv for other application availability in standalone mode
I was able to get application id easily by using command sparkContext.applicationId
In order to get driver-id I thought of using shell command pwd When your program runs the driver logs are written in directory named with driver-id So I extracted the folder name to get driver-id
import scala.sys.process._
val pwdCmd= "pwd"
val getDriverId=pwdCmd.!!
val driverId = get_driver_id.split("/").last.mkString

kill -9 $(jps | grep SparkSubmit | grep -Eo '[0-9]{1,7}')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get status of Spark jobs on YARN using REST API? - apache-spark

A spark application can run many jobs. My spark is running on yarn. Version 2.2.0. How to get job running status and other info for a given application id, possibly using REST API? job like follows: enter image description here

YARN has a Cluster Applications API. This shows the state along with other information. To use it: $ curl 'RMURL/ws/v1/cluster/apps/APP_ID' with your application id as APP_ID. It provides:

Related

How to check yarn logs application id

Read Large Data Set to Jupyter Notebook and Manipulate

Spark Job fails connecting to oracle in first attempt

How to get reports of a puppet node using puppetdb api?

Submit & Kill Spark Application program programmatically from another application

Categories

Resources