How to get spark SUBMISSION_ID with spark-submit?

How to get spark SUBMISSION_ID with spark-submit? - apache-spark

Many places need SUBMISSION_ID, like spark-submit --status and Spark REST API. But how can I get this SUBMISSION_ID when I use spark-submit command to submit spark jobs?
P.S.:
I use python [popen][2] to start a spark-submit job. I want SUBMISSION_ID so my python program can monitor spark job status via REST API: <ip>:6066/v1/submissions/status/<SUBMISSION_ID>

Thanks to the clue by #Pandey. The answer https://stackoverflow.com/a/37980813/5634636 helps me a lot.
TL;DR
If you want to submit spark job locally, the answer https://stackoverflow.com/a/37980813/5634636 indeed works.
The only point is you must use cluster mode to submit your job,
i.e., use parameter --deploy-mode cluster.
If you want to submit a spark job remotely, use Spark submission API. It will help a lot. See https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/ for details.
Detailed description
NOTE: I only test my approaches on Apache Spark 2.3.1. I can't guarantee that it will work in other versions as well.
Let's clear my requirement first. There're 3 features I wanted:
remote submit a spark job
check job status anytime (RUNNING, ERROR, FINISHED...)
get the error message if there is something error
Submit locally
NOTE: this answer only works in cluster mode
The Spark tool spark-submit will help.
To submit a job, see
https://spark.apache.org/docs/2.4.0/submitting-applications.html#launching-applications-with-spark-submit
To check the status, see https://stackoverflow.com/a/37420931/5634636. In this way, you need a SubmissionID. This answer https://stackoverflow.com/a/37980813/5634636 told you how to get a submission id in cluster mode. The submission id looks like driver-20190315142356-0004.
The error message is included in the job status message.
Submit remotely
Spark submission API is recommended. It seems that there is not any documentation on Apache Spark official website, so some people call it hidden API. For details, see: https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/
To submit Spark job, use submit API
To get the status of the job, use status API: http://<master-ip>:6066/v1/submissions/status/<submission-id>. The submission-id will be returned in a json when you submit jobs.
The error message is included in the status message.
More about the error message: note the difference between status ERROR and FAILED. In short, the FAILED means that there is something wrong during executing Spark jobs (e.g. uncaught exceptions), while the ERROR means there's something error during submitting (e.g. invalid jar path). The error message is included in the status json. If you want to view the FAILED reason, it can be accessed via http://<driver-ip>:<ui-port>/log/<submission-id>.
Here is an example of error status (**** is an incorrect jar path which is miswritten intentionally):
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: File hdfs:**** does not exist.\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)\n\torg.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)\n\torg.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:727)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:695)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:488)\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)",
"serverSparkVersion" : "2.3.1",
"submissionId" : "driver-20190315160943-0005",
"success" : true,
"workerHostPort" : "172.18.0.4:36962",
"workerId" : "worker-20190306214522-172.18.0.4-36962"
}

Related

Instrumenting Spark JDBC with javaagent

I am attempting to instrument JDBC calls using the Kamon JDBC Kanela agent in my Spark app.
I am able to successfully instrument JDBC calls in a non-spark test app by passing in -javaagent:kanela-agent-1.0.1.jar on the command line when I run the app from the JAR. When I do this, I see the Kanela banner display in the console, and can see that my failed statement processor is getting called when there is a SQL error.
From my research, I should be able to inject a javaagent into the executor of a Spark app by passing in the following to spark-submit: --conf "spark.executor.extraJavaOptions=-javaagent:kanela-agent-1.0.1.jar". However, when I do this, although the Kamon banner IS displaying on the console upon my call to Kamon.init(), my failed statement processor is NOT getting called when there is a SQL error.
Things I'm wondering:
Is there something about the way that spark-jdbc makes these JDBC calls that would prevent a javaagent from "seeing" them?
Does my call to Kamon.init() somehow only apply to code in the Spark driver, and not the executor?
Any other reason that you can think of that would be preventing this from working?

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.

Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

"FAILED: Execution Error, return code 3" after setting Hive engine from mr to Spark

I am trying use Spark engine in my Hive query.
It is an old query, and I don't want to convert the whole code to a spark job.
But when I run the query, it gives following error:
Status: Failed
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
The only thing I have changed is the execution engine:
set hive.execution.engine=spark;
The above change works for other similar queries. So I don't think that it's a configuration issue...
Or am I not aware that it is?
Has anybody faced this issue before?

Check the logs of the job to see the true error. Return code 1, 2 and 3 are all generic errors in both MR and Spark.

use verbose mode of beeline to run the query.
check query exeption logs, hiveserver logs, spark logs and spark webui worker logs (this often has the exact stack trace).
Try running spark in local mode.
What versions of hive, spark, hadoop do u use?

execute below command in hive client with hiveserver2 jdbc connection:
set hive.auto.convert.join=false;
It works for me.
Here is detail reason: https://www.cnblogs.com/CYan521/p/16716361.html

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.

After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>

To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

pass custom exitcode from yarn-cluster mode spark to CLI

I started a yarn cluster mode spark job through spark-submit.
To indicate partial failure etc I want to pass exitcode from driver to script calling spark-submit.
I tried both, System.exit and throwing SparkUserAppException in driver, but in both cases CLI only got 1, not what exitcode I passed.
I think it is impossible to pass custom exitcode, since any exitcode passed by driver will be converted to yarn status and yarn will convert any failed exitCode to 1 or failed.

By looking at spark code, I can conclude this:
It is possible in client mode. Look at runMain() method of SparkSubmit class
Whereas in cluster mode, it is not possible to get the exit status of the driver because your driver class will be running in one of the executors.
There an alternate solution that might/might not be suitable for your use case:
Host a REST API with an endpoint to receive the status update from your driver code. In the case of any exceptions, let your driver code use this endpoint to update the status.

You can save the exit code in the output file (on HDFS or local FS) and make your script wait for this file appearance, read and proceed. This is definitely is not an elegant way, but it may help you to proceed.
When saving file, pay attention to the permissions to this location. Your spark process has to have RW access.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string