How to check yarn logs application id - apache-spark

I am trying to running a bash script to run spark-submit and run a pyspark script but it was not successful. I want to check yarn logs using "yarn logs -applicationId ". My question is how can I find the appropriate application id?
Below is some parts of the error I got

1. Using Yarn Logs:
In logs you can see tracking URL: http://<nn>:8088/proxy/application_*****/
If you copy and open the link you can see all the logs for the application in Resourcemanager.
2.Using Spark application:
From sparkContext we can get the applicationID.
print(spark.sparkContext.aplicationId)
3. Using yarn application command:
Use yarn application --list command to get all the running yarn applications on the cluster then use
yarn application --help
-appStates <States> Works with -list to filter applications
based on input comma-separated list of
application states. The valid application
state can be one of the following:
ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN
NING,FINISHED,FAILED,KILLED
-appTypes <Types> Works with -list to filter applications
based on input comma-separated list of
application types.
-help Displays help for all commands.
-kill <Application ID> Kills the application.
-list List applications. Supports optional use
of -appTypes to filter applications based
on application type, and -appStates to
filter applications based on application
state.
-movetoqueue <Application ID> Moves the application to a different
queue.
-queue <Queue Name> Works with the movetoqueue command to
specify which queue to move an
application to.
-status <Application ID> Prints the status of the application.
List all the finished applications:
yarn application -appStates FINISHED -list

You can also use curl to get required details of your application using YARN Rest API.
state="RUNNING" // RUNNING, FAILED, COMPLETED.
user="" // userid from where you started job.
applicationTypes="spark" // Type of application
applicationName="<application_name>" // Your application name
url="http://<host_name>:8088/ws/v1/cluster/apps?state=${state}&user=${user}&applicationTypes=${applicationTypes}" // Build Rest API
applicationId=$(curl "${url}" | python -m json.tool | jq -r '.apps."app" | .[] | select(.name | contains('\"${applicationName}\"')) | .id')
Output
> echo $applicationId
application_1593019621736_42096

Related

how to get spark-sbumit logging result

When i submit spark job too terminal, it's has logging result like image in terminal.
How can i get it and set it to value or object?
You can pipe spark-submit command to the log file like this:
spark-submit ... > log.txt
You can see logs from application and you can always write logs to .text files
yarn logs -applicationId <application ID> [OPTIONS]
Application ID can be found through cluster UI

How to see more realtime logs when using spark-submit?

I am using 'spark-submit' with configuration file and package option and it is taking very long time to run..
How to 'TURN-ON' more logging (in realtime) so that can see where is the bottleneck (e.g. maybe a request to specific server is being made where I do not have access etc.)..
I would ideally want to see everything - from which libraries are being loaded to which request and to which server is being made.
Thanks.
In most cases, you can see all relevant information either on the Spark UI for currently running jobs (usually, this service is reachable at port 4040 of your driver) or (if your system has one) on the Spark History Server.
You can use the below parameters when you are using Spark on YARN.
--driver-java-options "-Dlog4j.error=true" --verbose
Or
You can always do the below to get logs from YARN
Use the following command format to view all logs of a particular type for a running application:
yarn logs -applicationId <Application ID> -log_files <log_file_type>
For example, to view only the stderr error logs:
yarn logs -applicationId <Application ID> -log_files stderr
The -logFiles option also supports Java regular expressions, so the following format would return all types of log files:
yarn logs -applicationId <Application ID> -log_files .*

How to get status of Spark jobs on YARN using REST API?

A spark application can run many jobs. My spark is running on yarn. Version 2.2.0.
How to get job running status and other info for a given application id, possibly using REST API?
job like follows:
enter image description here
This might be late but putting it for convenience. Hope it helps. You can use below Rest API command to get the status of any jobs running on YARN.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343/state'
O/P - {"state":"RUNNING"}
Throughout the job cycle the state will vary from NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
You can use jq for a formatted output.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343'| jq .app.state
O/P - "RUNNING"
YARN has a Cluster Applications API. This shows the state along with other information. To use it:
$ curl 'RMURL/ws/v1/cluster/apps/APP_ID'
with your application id as APP_ID.
It provides:

how to force kill YARN applicaton in NEW_SAVING State?

I have been kill yarn application using command yarn application -kill <app_id>.
I submitted a job which is currently under NEW_SAVING state and I want to kill it.
When i try yarn application -kill i get below message continuously
INFO impl.YarnClientImpl: Waiting for application application_XXXX_XXXX to be killed.
Any idea how can i kill it forcefully?
The output of the 'yarn application -list' contains the following information of yarn applications:
Application-Id
Application-Name
Application-Type
User
Queue
State
Final-State
Progress
Tracking-URL
You can list the applications and awk by the required parameter. For ex: to list the applications by 'Application-Name'
yarn application -list | awk '$6 == "NEW_SAVING" { print $1 }' > applications_list.txt
Then you can iterate through the file and kill the applications as below:
while read p; do
echo $p
yarn application -kill $p
done <applications_list.txt
Do you have access to YARN Cluster UI? You could kill the application from UI. Usually, that works better for me than yarn application -kill.

Submit & Kill Spark Application program programmatically from another application

I am wondering if it is possible to submit, monitor & kill spark applications from another service.
My requirements are as follows:
I wrote a service that
parse user commands
translate them into understandable arguments to an already prepared Spark-SQL application
submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder
And plans to run generated applications' driver in cluster mode.
Other requirements needs:
Query about the applications status, for example, the percentage remains
Kill queries accrodingly
What I find in spark standalone documentation suggest kill application using:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
And should find the driver ID through the standalone Master web UI at http://<master url>:8080.
So, what am I supposed to do?
Related SO questions:
Spark application finished callback
Deploy Apache Spark application from another application in Java, best practice
You could use shell script to do this.
The deploy script:
#!/bin/bash
spark-submit --class "xx.xx.xx" \
--deploy-mode cluster \
--supervise \
--executor-memory 6G hdfs:///spark-stat.jar > output 2>&1
cat output
and you will get output like this:
16/06/23 08:37:21 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submission successfully created as driver-20160623083722-0026. Polling submission state...
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submitting a request for the status of submission driver-20160623083722-0026 in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: State of driver driver-20160623083722-0026 is now RUNNING.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Driver is running on worker worker-20160621162532-192.168.1.200-7078 at 192.168.1.200:7078.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160623083722-0026",
"serverSparkVersion" : "1.6.0",
"submissionId" : "driver-20160623083722-0026",
"success" : true
}
And based on this, create your kill driver script
#!/bin/bash
driverid=`cat output | grep submissionId | grep -Po 'driver-\d+-\d+'`
spark-submit --master spark://node-1:6066 --kill $driverid
Make sure given the script execute permission by using chmod +x
A "dirty" trick to kill spark apps is by kill the jps named SparkSubmit. The main problem is that the app will be "killed" but at spark master log it will appear as "finished"...
user#user:~$ jps
20894 Jps
20704 SparkSubmit
user#user:~$ kill 20704
To be honest I don't like this solution but by now is the only way I know to kill an app.
Here's what I do:
To submit apps, use the (hidden) Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
This way you get a DriverID (under submissionId) which you can use to kill your Job later (you shouldn't Kill the Application, specially if you're using "supervise" on Standalone mode)
This API also lets you query the Driver Status
Query status for apps using the (also hidden) UI Json API: http://[master-node]:[master-ui-port]/json/
This service exposes all information available on the master UI in JSON format.
You can also use the "public" REST API to query Applications on Master or Executors on each worker, but this won't expose Drivers (at least not as of Spark 1.6)
you can fire yarn commnds from processbuilder to list the applications and then filter based on your application name that is available with you, extract the appId and then use Yarn commands poll the status/kill etc.
You can find driver id in [spark]/work/. The id is the directory name. Kill the job by spark-submit.
I also have same kind of problem where I need to map my application-id and driver-id and add them a csv for other application availability in standalone mode
I was able to get application id easily by using command sparkContext.applicationId
In order to get driver-id I thought of using shell command pwd When your program runs the driver logs are written in directory named with driver-id So I extracted the folder name to get driver-id
import scala.sys.process._
val pwdCmd= "pwd"
val getDriverId=pwdCmd.!!
val driverId = get_driver_id.split("/").last.mkString
kill -9 $(jps | grep SparkSubmit | grep -Eo '[0-9]{1,7}')

Resources