how to get spark-sbumit logging result - apache-spark

When i submit spark job too terminal, it's has logging result like image in terminal.
How can i get it and set it to value or object?

You can pipe spark-submit command to the log file like this:
spark-submit ... > log.txt

You can see logs from application and you can always write logs to .text files
yarn logs -applicationId <application ID> [OPTIONS]
Application ID can be found through cluster UI

Related

spark operator stuck with spark structured streaming app - not processing data, no errors in logs

I have a spark structured streaming application that work fine if run via spark submit and correctly processes data. If ran via spark operator, no data are processed and I am unable to find any error in logs.
kubectl get pods -n spark-operator
log from kubectl logs spark-minio-driver -n spark-operator
ends with running the jar
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator driver --properties-file /opt/spark/conf/spark.properties --class RunParseLogs local:///opt/spark/work-dir/LogParser.jar
but nothing further is seen in the logs. It seems stuck and based on the spark operator examples I would expect this log to continue with usual spark logs about running the application.
log from the spark operator kubectl logs latest-spark-operator-94bc4f779-dxhck -n spark-operator show one error below, but after resubmitting the app, this goes away:
This does not say much - what might be the issue? any other reasons why the app might be stuck?

How to check yarn logs application id

I am trying to running a bash script to run spark-submit and run a pyspark script but it was not successful. I want to check yarn logs using "yarn logs -applicationId ". My question is how can I find the appropriate application id?
Below is some parts of the error I got
1. Using Yarn Logs:
In logs you can see tracking URL: http://<nn>:8088/proxy/application_*****/
If you copy and open the link you can see all the logs for the application in Resourcemanager.
2.Using Spark application:
From sparkContext we can get the applicationID.
print(spark.sparkContext.aplicationId)
3. Using yarn application command:
Use yarn application --list command to get all the running yarn applications on the cluster then use
yarn application --help
-appStates <States> Works with -list to filter applications
based on input comma-separated list of
application states. The valid application
state can be one of the following:
ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN
NING,FINISHED,FAILED,KILLED
-appTypes <Types> Works with -list to filter applications
based on input comma-separated list of
application types.
-help Displays help for all commands.
-kill <Application ID> Kills the application.
-list List applications. Supports optional use
of -appTypes to filter applications based
on application type, and -appStates to
filter applications based on application
state.
-movetoqueue <Application ID> Moves the application to a different
queue.
-queue <Queue Name> Works with the movetoqueue command to
specify which queue to move an
application to.
-status <Application ID> Prints the status of the application.
List all the finished applications:
yarn application -appStates FINISHED -list
You can also use curl to get required details of your application using YARN Rest API.
state="RUNNING" // RUNNING, FAILED, COMPLETED.
user="" // userid from where you started job.
applicationTypes="spark" // Type of application
applicationName="<application_name>" // Your application name
url="http://<host_name>:8088/ws/v1/cluster/apps?state=${state}&user=${user}&applicationTypes=${applicationTypes}" // Build Rest API
applicationId=$(curl "${url}" | python -m json.tool | jq -r '.apps."app" | .[] | select(.name | contains('\"${applicationName}\"')) | .id')
Output
> echo $applicationId
application_1593019621736_42096

How do I manage print output in Spark jobs?

I'd like to view the output of print statements in my Spark applications, which uses Python/PySpark. Am I correct that these outputs aren't considered part of logging? I changed my conf/log4j.properties file to output to a specific file but just the INFO and other logs are being written to the designated log file.
How do I go about directing the output from print statements to a file? Do I have to do the typical redirection like this: /usr/bin/spark-submit --master yarn --deploy-mode client --queue default /home/hadoop/app.py > /home/hadoop/output?

How to see more realtime logs when using spark-submit?

I am using 'spark-submit' with configuration file and package option and it is taking very long time to run..
How to 'TURN-ON' more logging (in realtime) so that can see where is the bottleneck (e.g. maybe a request to specific server is being made where I do not have access etc.)..
I would ideally want to see everything - from which libraries are being loaded to which request and to which server is being made.
Thanks.
In most cases, you can see all relevant information either on the Spark UI for currently running jobs (usually, this service is reachable at port 4040 of your driver) or (if your system has one) on the Spark History Server.
You can use the below parameters when you are using Spark on YARN.
--driver-java-options "-Dlog4j.error=true" --verbose
Or
You can always do the below to get logs from YARN
Use the following command format to view all logs of a particular type for a running application:
yarn logs -applicationId <Application ID> -log_files <log_file_type>
For example, to view only the stderr error logs:
yarn logs -applicationId <Application ID> -log_files stderr
The -logFiles option also supports Java regular expressions, so the following format would return all types of log files:
yarn logs -applicationId <Application ID> -log_files .*

Submit & Kill Spark Application program programmatically from another application

I am wondering if it is possible to submit, monitor & kill spark applications from another service.
My requirements are as follows:
I wrote a service that
parse user commands
translate them into understandable arguments to an already prepared Spark-SQL application
submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder
And plans to run generated applications' driver in cluster mode.
Other requirements needs:
Query about the applications status, for example, the percentage remains
Kill queries accrodingly
What I find in spark standalone documentation suggest kill application using:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
And should find the driver ID through the standalone Master web UI at http://<master url>:8080.
So, what am I supposed to do?
Related SO questions:
Spark application finished callback
Deploy Apache Spark application from another application in Java, best practice
You could use shell script to do this.
The deploy script:
#!/bin/bash
spark-submit --class "xx.xx.xx" \
--deploy-mode cluster \
--supervise \
--executor-memory 6G hdfs:///spark-stat.jar > output 2>&1
cat output
and you will get output like this:
16/06/23 08:37:21 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submission successfully created as driver-20160623083722-0026. Polling submission state...
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Submitting a request for the status of submission driver-20160623083722-0026 in spark://node-1:6066.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: State of driver driver-20160623083722-0026 is now RUNNING.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Driver is running on worker worker-20160621162532-192.168.1.200-7078 at 192.168.1.200:7078.
16/06/23 08:37:22 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160623083722-0026",
"serverSparkVersion" : "1.6.0",
"submissionId" : "driver-20160623083722-0026",
"success" : true
}
And based on this, create your kill driver script
#!/bin/bash
driverid=`cat output | grep submissionId | grep -Po 'driver-\d+-\d+'`
spark-submit --master spark://node-1:6066 --kill $driverid
Make sure given the script execute permission by using chmod +x
A "dirty" trick to kill spark apps is by kill the jps named SparkSubmit. The main problem is that the app will be "killed" but at spark master log it will appear as "finished"...
user#user:~$ jps
20894 Jps
20704 SparkSubmit
user#user:~$ kill 20704
To be honest I don't like this solution but by now is the only way I know to kill an app.
Here's what I do:
To submit apps, use the (hidden) Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
This way you get a DriverID (under submissionId) which you can use to kill your Job later (you shouldn't Kill the Application, specially if you're using "supervise" on Standalone mode)
This API also lets you query the Driver Status
Query status for apps using the (also hidden) UI Json API: http://[master-node]:[master-ui-port]/json/
This service exposes all information available on the master UI in JSON format.
You can also use the "public" REST API to query Applications on Master or Executors on each worker, but this won't expose Drivers (at least not as of Spark 1.6)
you can fire yarn commnds from processbuilder to list the applications and then filter based on your application name that is available with you, extract the appId and then use Yarn commands poll the status/kill etc.
You can find driver id in [spark]/work/. The id is the directory name. Kill the job by spark-submit.
I also have same kind of problem where I need to map my application-id and driver-id and add them a csv for other application availability in standalone mode
I was able to get application id easily by using command sparkContext.applicationId
In order to get driver-id I thought of using shell command pwd When your program runs the driver logs are written in directory named with driver-id So I extracted the folder name to get driver-id
import scala.sys.process._
val pwdCmd= "pwd"
val getDriverId=pwdCmd.!!
val driverId = get_driver_id.split("/").last.mkString
kill -9 $(jps | grep SparkSubmit | grep -Eo '[0-9]{1,7}')

Resources