Spark - pass property to spark-submit - apache-spark

Spark 1.5.1 with --master yarn-cluster. What I am trying to accomplish is to pass a variable to spark-submit command that will uniquely define spawned application. I do submit spark jobs from external application via webservice (we have another simple web layer application on dropwizard with an endpoint that submits applications). Another webservice will return status of an operation for given identifier. The flow:
SUBMIT JOB:
MyApp -> "/Dropwizard/submit-job?id=100" -> Dropwizard -> "spark-submit --conf=id=100" -> Spark
GET STATUS
MyApp -> "/Dropwizard/status?id=100" -> Dropwizard -> "this will get information from files that are created when application runs. Files will have id in their names"
Problem is sparkContext.getConf().get("id"); returns null.
Can you please give me a clue how to use --conf or drop an idea how can I resolve the problem other way around?

It should be --conf id=100 as shown in the samples here

Related

SnappyData REST API to Submit Job

I am trying to submit Snappy Job using REST API.
We have been able to submit SnappyJob using snappy-job submit Command
Line tool.
I could not find any documentation how to do the same thing through
REST API.
I found somewhere mentioned in the forum that SnappyData is using the
spark jobserver REST API.
Could you point to the Documentation / User Guide how to do that?
Snappydata internally uses spark-jobserver for submitting the jobs. Hence, all the spark-jobserver REST APIs are accessible on Snappydata's lead node.
You can refer to all spark-jobserver API here: https://github.com/SnappyDataInc/spark-jobserver#api
Here are some useful curl commands to clarify it further:
deploy application jar on job-server:
curl --data-binary #/path/to/applicaton.jar localhost:8090/jars/testApp
testApp is the name of the job server app which will be used to submit the job
create context:
curl -X POST "localhost:8090/contexts/testSnappyContext?context-factory=org.apache.spark.sql.SnappySessionFactory"
testSnappyContext is the name of the context which will be used to submit the job.
Also, note that we are passing a custom context-factory argument here which is necessary for submitting snappy job.
submit the job:
curl -d "configKey1=configValue1,configKey2=configValue2" "localhost:8090/jobs?appName=testApp&classPath=com.package.Main&context=testSnappyContext"
com.package.Main is the fully-qualified name of the class which is extending org.apache.spark.sql.SnappySQLJob.
stop the job
curl -X DELETE localhost:8090/jobs/bfed84a1-0b06-47ca-81a7-9b8defb51e38
bfed84a1-0b06-47ca-81a7-9b8defb51e38 is the job-id which you will get in the response of job submit request
stop the context
curl -X DELETE localhost:8090/contexts/testSnappyContext
undeploying the application jar
The version of job-server being used by snappydata doesn't have a RESTful API exposed for undeploying the jar. However, deploying any jar with the same app name (testApp in our example) will override the previously deployed jar for the same app.

How to get spark SUBMISSION_ID with spark-submit?

Many places need SUBMISSION_ID, like spark-submit --status and Spark REST API. But how can I get this SUBMISSION_ID when I use spark-submit command to submit spark jobs?
P.S.:
I use python [popen][2] to start a spark-submit job. I want SUBMISSION_ID so my python program can monitor spark job status via REST API: <ip>:6066/v1/submissions/status/<SUBMISSION_ID>
Thanks to the clue by #Pandey. The answer https://stackoverflow.com/a/37980813/5634636 helps me a lot.
TL;DR
If you want to submit spark job locally, the answer https://stackoverflow.com/a/37980813/5634636 indeed works.
The only point is you must use cluster mode to submit your job,
i.e., use parameter --deploy-mode cluster.
If you want to submit a spark job remotely, use Spark submission API. It will help a lot. See https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/ for details.
Detailed description
NOTE: I only test my approaches on Apache Spark 2.3.1. I can't guarantee that it will work in other versions as well.
Let's clear my requirement first. There're 3 features I wanted:
remote submit a spark job
check job status anytime (RUNNING, ERROR, FINISHED...)
get the error message if there is something error
Submit locally
NOTE: this answer only works in cluster mode
The Spark tool spark-submit will help.
To submit a job, see
https://spark.apache.org/docs/2.4.0/submitting-applications.html#launching-applications-with-spark-submit
To check the status, see https://stackoverflow.com/a/37420931/5634636. In this way, you need a SubmissionID. This answer https://stackoverflow.com/a/37980813/5634636 told you how to get a submission id in cluster mode. The submission id looks like driver-20190315142356-0004.
The error message is included in the job status message.
Submit remotely
Spark submission API is recommended. It seems that there is not any documentation on Apache Spark official website, so some people call it hidden API. For details, see: https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/
To submit Spark job, use submit API
To get the status of the job, use status API: http://<master-ip>:6066/v1/submissions/status/<submission-id>. The submission-id will be returned in a json when you submit jobs.
The error message is included in the status message.
More about the error message: note the difference between status ERROR and FAILED. In short, the FAILED means that there is something wrong during executing Spark jobs (e.g. uncaught exceptions), while the ERROR means there's something error during submitting (e.g. invalid jar path). The error message is included in the status json. If you want to view the FAILED reason, it can be accessed via http://<driver-ip>:<ui-port>/log/<submission-id>.
Here is an example of error status (**** is an incorrect jar path which is miswritten intentionally):
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: File hdfs:**** does not exist.\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)\n\torg.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)\n\torg.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:727)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:695)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:488)\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)",
"serverSparkVersion" : "2.3.1",
"submissionId" : "driver-20190315160943-0005",
"success" : true,
"workerHostPort" : "172.18.0.4:36962",
"workerId" : "worker-20190306214522-172.18.0.4-36962"
}

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

submit job to remote hazelcast cluster

I'm new to Hazelcast Jet and have a very basic question. I have a 3-node JET cluster set up. I have a sample code to read from Kafka and drain to an IMap. When I run it from command-line (using jet-submit.sh and use JetBootstrap.getInstance() to acquire JET client instance) it works perfectly fine. When I run the same code (using Jet.newJetClient() to acquire the instance and Run As -> Java application on Eclipse), I get:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field com.hazelcast.jet.core.ProcessorMetaSupplier.
Could you please let me know where am I going wrong?
One of your lambda functions captures an outside variable, probably defined at class level, and that class is not Serializable or not added to the Job config when submitting from client. This is done automatically when submitting via the script.
Please see http://docs.hazelcast.org/docs/jet/0.6.1/manual/#remember-that-a-jet-job-is-distributed
When you use a client instance to submit the job, you have to add all classes that contain the code called by the job to the JobConfig:
JobConfig config = new JobConfig();
config.addClass(...);
config.addJar(...);
...
client.newJob(pipeline, config);
For example, if you use a lambda for stage.map(), the class containing the lambda has to be added.
The jet-submit.sh script makes this easier by automatically adding the entire submitted .jar file.

Run spark-submit in Scala code

Is it possible to execute below spark-submit script within code and then get application ID that'll assign by YARN?
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
This is to enable user to start and stop the job via REST API.
I found,
https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
But I couldn't find a method to get application ID , also seems like app.jar has to be pre built before executing above code ?
Yes your application jar does need to be prebuilt in those cases. It seems like something like the Spark Job Server or IBM Spark Kernel may be closer to what you want (although they reuse a Spark Context).
SparkLauncher will only submit your built application. To get the application ID, you need to access the SparkContext within your application jar.
In your example, you could access the application ID in "/my/app.jar" (perhaps in "my.spark.app.Main") with:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
...
val sc = new SparkContext(new SparkConf())
sc.applicationId
This application ID will be the YARN application ID when the application is built and submitted in yarn-cluster mode.
See the Spark Scala API docs.
Support for accessing launched applications seems to be coming in Spark 1.6 (SPARK-8673). A Scala example derived from this test suite is below.
val handle = new SparkLauncher()
... // application configuration
.setMaster("yarn-client")
.startApplication()
try {
handle.getAppId() should startWith ("application_")
handle.stop()
} finally {
handle.kill()
}
Handlers may be added to launched applications, but a listener API is exposed and is the recommended way for monitoring launched applications. See this pull request for details.
Scala has SparkContext.applicationId, which is a unique identifier for the Spark application. Its format depends on the scheduler implementation. (i.e. in case of local spark app something like 'local-1433865536131' in case of YARN something like 'application_1433865536131_34483' )
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Resources