How to set proxy user in Livy Job submit through its Java API - apache-spark

I am using Livy's Java API to submit a spark job on YARN on my cluster. Currently the jobs are being submitted as 'livy' user, but I want to submit the job as a proxy user from Livy.
It is possible to do this by sending POST request to the Livy server, by passing a field in the POST data. I was thinking if this could be done by Livy's Java API.
I am using the standard way to submit a Job:
LivyClient client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build();
try {
System.err.printf("Uploading %s to the Spark context...\n", piJar);
client.uploadJar(new File(piJar)).get();
System.err.printf("Running PiJob with %d samples...\n", samples);
double pi = client.submit(new PiJob(samples)).get();
System.out.println("Pi is roughly: " + pi);
} finally {
client.stop(true);
}

Posting answer to my own question.
Currently there is no way to set the proxy user through the LivyClientBuilder.
A workaround for this is:
Create the session through the REST API (POST request to < livy-server >/session/ ) and read the session ID from the request's response. Proxy user can be set via the REST API by passing it in the POST data: {"kind": "spark", "proxyUser": "lok"}
Once the session is created, connect to it using the ID via LivyClientBuilder ( livyURL would be < livy-server >/sessions/< id >/ ).

Related

SnappyData REST API to Submit Job

I am trying to submit Snappy Job using REST API.
We have been able to submit SnappyJob using snappy-job submit Command
Line tool.
I could not find any documentation how to do the same thing through
REST API.
I found somewhere mentioned in the forum that SnappyData is using the
spark jobserver REST API.
Could you point to the Documentation / User Guide how to do that?
Snappydata internally uses spark-jobserver for submitting the jobs. Hence, all the spark-jobserver REST APIs are accessible on Snappydata's lead node.
You can refer to all spark-jobserver API here: https://github.com/SnappyDataInc/spark-jobserver#api
Here are some useful curl commands to clarify it further:
deploy application jar on job-server:
curl --data-binary #/path/to/applicaton.jar localhost:8090/jars/testApp
testApp is the name of the job server app which will be used to submit the job
create context:
curl -X POST "localhost:8090/contexts/testSnappyContext?context-factory=org.apache.spark.sql.SnappySessionFactory"
testSnappyContext is the name of the context which will be used to submit the job.
Also, note that we are passing a custom context-factory argument here which is necessary for submitting snappy job.
submit the job:
curl -d "configKey1=configValue1,configKey2=configValue2" "localhost:8090/jobs?appName=testApp&classPath=com.package.Main&context=testSnappyContext"
com.package.Main is the fully-qualified name of the class which is extending org.apache.spark.sql.SnappySQLJob.
stop the job
curl -X DELETE localhost:8090/jobs/bfed84a1-0b06-47ca-81a7-9b8defb51e38
bfed84a1-0b06-47ca-81a7-9b8defb51e38 is the job-id which you will get in the response of job submit request
stop the context
curl -X DELETE localhost:8090/contexts/testSnappyContext
undeploying the application jar
The version of job-server being used by snappydata doesn't have a RESTful API exposed for undeploying the jar. However, deploying any jar with the same app name (testApp in our example) will override the previously deployed jar for the same app.

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

How to run Spark Application as daemon

I have a basic question about running spark application.
I have a Java client which will send me request for query data which is residing in HDFS.
The request I get is REST API over HTTP and I need to interpret the request and form Spark SQL queries and return the response back to client.
I am unable to understand how can I make my spark application as daemon which is waiting for request and can execute the queries using the pre instantiated SQL context ?
The best option I've seen for this use case is Spark Job Server, which will be the daemon app, with your driver code deployed to it as a named application.
This option gives you even far more features such as persistence.
With job server, you don't need to code your own daemon and your client apps can send REST requests directly to it, which in turn will execute the spark-submit tasks.
You can have a thread that run in an infinite loop to do the calculation with Spark.
while (true) {
request = incomingQueue.poll()
// Process the request with Spark
val result = ...
outgoingQueue.put(result)
}
Then in the thread that handle the REST request, you put the request in the incomingQueue and wait for the result from the outgoingQueue.
// Create the request from the REST call
val request = ...
incompingQueue.put(request)
val result = outgoingQueue.poll()
return result

Hazelcast and Spring Session: REST API returns epmty values

I try to integrate Spring Session and Hazelcast. I am using very simple configuration:
com.hazelcast.config.Config cfg = new com.hazelcast.config.Config();
NetworkConfig netConfig = new NetworkConfig();
netConfig.setPort(SocketUtils.findAvailableTcpPort());
System.out.println("Hazelcast port #: " + netConfig.getPort());
cfg.setNetworkConfig(netConfig);
SerializerConfig serializer = new SerializerConfig().setTypeClass(Object.class)
.setImplementation(new ObjectStreamSerializer());
cfg.getSerializationConfig().addSerializerConfig(serializer);
return Hazelcast.newHazelcastInstance(cfg);
It is from Spring docs example. Everything ok, but when I try to get session from Hazelcast with its Rest APi it returns empty values 0curl: (52) Empty reply from server
$ curl -X GET http://localhost:port/hazelcast/rest/maps/spring:session:sessions/session-id
Where port is port, selected with SocketUtils.findAvailableTcpPort() and session-id is session id in browser.
How I can access my saved sessions with Hazelcast REST API?
Update:
By adding cfg.setProperty("hazelcast.rest.enabled","true"); all problems disappeared.
You have to activate the REST API service which is disabled by default (for security reasons). Please see http://docs.hazelcast.org/docs/3.6/manual/html-single/index.html#system-properties and search for hazelcast.rest.enabled.

Resources