How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow - apache-spark

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?

Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response

Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

Related

How to call Cluster API and start cluster from within Databricks Notebook?

Currently we are using a bunch of notebooks to process our data in azure databricks using mainly python/pyspark.
What we want to achieve is make sure that our clusters are started (warmed up) before initiating the data processing. For that reason we are exploring ways to get access to the Cluster API from within databricks notebooks.
So far we tried running the following:
import subprocess
cluster_id = "XXXX-XXXXXX-XXXXXXX"
subprocess.run(
[f'databricks clusters start --cluster-id "{cluster_id}"'], shell=True
)
which however returns below and nothing really happens afterwards. Cluster is not started.
CompletedProcess(args=['databricks clusters start --cluster-id "0824-153237-ovals313"'], returncode=127)
Is there any convenient and smart way to call the ClusterAPI from within databricks notebook or maybe call a curl command and how is this achieved?
Most probably the error is coming from the incorrectly configured credentials.
Instead of using command-line application it's better to use the Start command of Clusters REST API. This could be done with something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = "some_id" # put your cluster ID here
requests.post(
f'https://{host_name}/api/2.0/clusters/get',
json = {'cluster_id': cluster_id},
headers={'Authorization': f'Bearer {host_token}'}
)
and then you can monitor the status using the Get endpoint until it gets into the RUNNING state:
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
status = response['state']

SnappyData REST API to Submit Job

I am trying to submit Snappy Job using REST API.
We have been able to submit SnappyJob using snappy-job submit Command
Line tool.
I could not find any documentation how to do the same thing through
REST API.
I found somewhere mentioned in the forum that SnappyData is using the
spark jobserver REST API.
Could you point to the Documentation / User Guide how to do that?
Snappydata internally uses spark-jobserver for submitting the jobs. Hence, all the spark-jobserver REST APIs are accessible on Snappydata's lead node.
You can refer to all spark-jobserver API here: https://github.com/SnappyDataInc/spark-jobserver#api
Here are some useful curl commands to clarify it further:
deploy application jar on job-server:
curl --data-binary #/path/to/applicaton.jar localhost:8090/jars/testApp
testApp is the name of the job server app which will be used to submit the job
create context:
curl -X POST "localhost:8090/contexts/testSnappyContext?context-factory=org.apache.spark.sql.SnappySessionFactory"
testSnappyContext is the name of the context which will be used to submit the job.
Also, note that we are passing a custom context-factory argument here which is necessary for submitting snappy job.
submit the job:
curl -d "configKey1=configValue1,configKey2=configValue2" "localhost:8090/jobs?appName=testApp&classPath=com.package.Main&context=testSnappyContext"
com.package.Main is the fully-qualified name of the class which is extending org.apache.spark.sql.SnappySQLJob.
stop the job
curl -X DELETE localhost:8090/jobs/bfed84a1-0b06-47ca-81a7-9b8defb51e38
bfed84a1-0b06-47ca-81a7-9b8defb51e38 is the job-id which you will get in the response of job submit request
stop the context
curl -X DELETE localhost:8090/contexts/testSnappyContext
undeploying the application jar
The version of job-server being used by snappydata doesn't have a RESTful API exposed for undeploying the jar. However, deploying any jar with the same app name (testApp in our example) will override the previously deployed jar for the same app.

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.
Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

How to set proxy user in Livy Job submit through its Java API

I am using Livy's Java API to submit a spark job on YARN on my cluster. Currently the jobs are being submitted as 'livy' user, but I want to submit the job as a proxy user from Livy.
It is possible to do this by sending POST request to the Livy server, by passing a field in the POST data. I was thinking if this could be done by Livy's Java API.
I am using the standard way to submit a Job:
LivyClient client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build();
try {
System.err.printf("Uploading %s to the Spark context...\n", piJar);
client.uploadJar(new File(piJar)).get();
System.err.printf("Running PiJob with %d samples...\n", samples);
double pi = client.submit(new PiJob(samples)).get();
System.out.println("Pi is roughly: " + pi);
} finally {
client.stop(true);
}
Posting answer to my own question.
Currently there is no way to set the proxy user through the LivyClientBuilder.
A workaround for this is:
Create the session through the REST API (POST request to < livy-server >/session/ ) and read the session ID from the request's response. Proxy user can be set via the REST API by passing it in the POST data: {"kind": "spark", "proxyUser": "lok"}
Once the session is created, connect to it using the ID via LivyClientBuilder ( livyURL would be < livy-server >/sessions/< id >/ ).

How to run Spark Application as daemon

I have a basic question about running spark application.
I have a Java client which will send me request for query data which is residing in HDFS.
The request I get is REST API over HTTP and I need to interpret the request and form Spark SQL queries and return the response back to client.
I am unable to understand how can I make my spark application as daemon which is waiting for request and can execute the queries using the pre instantiated SQL context ?
The best option I've seen for this use case is Spark Job Server, which will be the daemon app, with your driver code deployed to it as a named application.
This option gives you even far more features such as persistence.
With job server, you don't need to code your own daemon and your client apps can send REST requests directly to it, which in turn will execute the spark-submit tasks.
You can have a thread that run in an infinite loop to do the calculation with Spark.
while (true) {
request = incomingQueue.poll()
// Process the request with Spark
val result = ...
outgoingQueue.put(result)
}
Then in the thread that handle the REST request, you put the request in the incomingQueue and wait for the result from the outgoingQueue.
// Create the request from the REST call
val request = ...
incompingQueue.put(request)
val result = outgoingQueue.poll()
return result

Resources