SnappyData REST API to Submit Job - apache-spark

I am trying to submit Snappy Job using REST API.
We have been able to submit SnappyJob using snappy-job submit Command
Line tool.
I could not find any documentation how to do the same thing through
REST API.
I found somewhere mentioned in the forum that SnappyData is using the
spark jobserver REST API.
Could you point to the Documentation / User Guide how to do that?

Snappydata internally uses spark-jobserver for submitting the jobs. Hence, all the spark-jobserver REST APIs are accessible on Snappydata's lead node.
You can refer to all spark-jobserver API here: https://github.com/SnappyDataInc/spark-jobserver#api
Here are some useful curl commands to clarify it further:
deploy application jar on job-server:
curl --data-binary #/path/to/applicaton.jar localhost:8090/jars/testApp
testApp is the name of the job server app which will be used to submit the job
create context:
curl -X POST "localhost:8090/contexts/testSnappyContext?context-factory=org.apache.spark.sql.SnappySessionFactory"
testSnappyContext is the name of the context which will be used to submit the job.
Also, note that we are passing a custom context-factory argument here which is necessary for submitting snappy job.
submit the job:
curl -d "configKey1=configValue1,configKey2=configValue2" "localhost:8090/jobs?appName=testApp&classPath=com.package.Main&context=testSnappyContext"
com.package.Main is the fully-qualified name of the class which is extending org.apache.spark.sql.SnappySQLJob.
stop the job
curl -X DELETE localhost:8090/jobs/bfed84a1-0b06-47ca-81a7-9b8defb51e38
bfed84a1-0b06-47ca-81a7-9b8defb51e38 is the job-id which you will get in the response of job submit request
stop the context
curl -X DELETE localhost:8090/contexts/testSnappyContext
undeploying the application jar
The version of job-server being used by snappydata doesn't have a RESTful API exposed for undeploying the jar. However, deploying any jar with the same app name (testApp in our example) will override the previously deployed jar for the same app.

Related

Using logback.xml in Apache Livy

I am trying to use logback.xml with Apache Livy ReST API and I'm having trouble getting it to work. I've tried submitting the logback.xml path as follows.
data = {
"file" : "<PATH_TO_JAR_FILE>",
"className" : "<CLASS_NAME>",
"files": ["file:///path/to/log4j.properties", "file:///path/to/logback.xml"],
"conf": {"driver-java-options":"-Dlogback.configurationFile=file:///path/to/logback.xml"}
}
However, I get the warning Ignoring non-spark config property: driver-java-options=-Dlogback.configurationFile. I can add the logback.xml file using spark.driver.extraJavaOptions instead of driver-java-options, but it doesn't make a difference since the JVM has already started. So my question is, How do I get Livy to use the logback.xml.
I see that you're trying to configure logging of Apache Livy server. But using its REST API (IIUC you are sending POST /batches or /sessions) you are submitting Spark job and feeding logback.xml to Spark Driver process, which is separate from Livy Server process.
What you need to do is to pass -Dlogback.configurationFile=... option to Livy Server process on its startup. Most probably (depending on your Livy version and environment) you should set export LIVY_SERVER_JAVA_OPTS=-Dlogback.configurationFile=... environment variable and start Livy with $LIVY_HOME/bin/livy-server start. Refer the script on GitHub.

How to get spark SUBMISSION_ID with spark-submit?

Many places need SUBMISSION_ID, like spark-submit --status and Spark REST API. But how can I get this SUBMISSION_ID when I use spark-submit command to submit spark jobs?
P.S.:
I use python [popen][2] to start a spark-submit job. I want SUBMISSION_ID so my python program can monitor spark job status via REST API: <ip>:6066/v1/submissions/status/<SUBMISSION_ID>
Thanks to the clue by #Pandey. The answer https://stackoverflow.com/a/37980813/5634636 helps me a lot.
TL;DR
If you want to submit spark job locally, the answer https://stackoverflow.com/a/37980813/5634636 indeed works.
The only point is you must use cluster mode to submit your job,
i.e., use parameter --deploy-mode cluster.
If you want to submit a spark job remotely, use Spark submission API. It will help a lot. See https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/ for details.
Detailed description
NOTE: I only test my approaches on Apache Spark 2.3.1. I can't guarantee that it will work in other versions as well.
Let's clear my requirement first. There're 3 features I wanted:
remote submit a spark job
check job status anytime (RUNNING, ERROR, FINISHED...)
get the error message if there is something error
Submit locally
NOTE: this answer only works in cluster mode
The Spark tool spark-submit will help.
To submit a job, see
https://spark.apache.org/docs/2.4.0/submitting-applications.html#launching-applications-with-spark-submit
To check the status, see https://stackoverflow.com/a/37420931/5634636. In this way, you need a SubmissionID. This answer https://stackoverflow.com/a/37980813/5634636 told you how to get a submission id in cluster mode. The submission id looks like driver-20190315142356-0004.
The error message is included in the job status message.
Submit remotely
Spark submission API is recommended. It seems that there is not any documentation on Apache Spark official website, so some people call it hidden API. For details, see: https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/
To submit Spark job, use submit API
To get the status of the job, use status API: http://<master-ip>:6066/v1/submissions/status/<submission-id>. The submission-id will be returned in a json when you submit jobs.
The error message is included in the status message.
More about the error message: note the difference between status ERROR and FAILED. In short, the FAILED means that there is something wrong during executing Spark jobs (e.g. uncaught exceptions), while the ERROR means there's something error during submitting (e.g. invalid jar path). The error message is included in the status json. If you want to view the FAILED reason, it can be accessed via http://<driver-ip>:<ui-port>/log/<submission-id>.
Here is an example of error status (**** is an incorrect jar path which is miswritten intentionally):
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: File hdfs:**** does not exist.\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)\n\torg.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)\n\torg.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:727)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:695)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:488)\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)",
"serverSparkVersion" : "2.3.1",
"submissionId" : "driver-20190315160943-0005",
"success" : true,
"workerHostPort" : "172.18.0.4:36962",
"workerId" : "worker-20190306214522-172.18.0.4-36962"
}

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.
Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

Resources