I am getting started with Apache Livy and I was able to follow online documentation and was able to submit the spark job through Curl(I have posted another question on converting curl to REST call). My plan was to tryout with curl and then convert that curl to a REST API call from Scala. However after spending entire day to figure out how to convert Livy curl call to REST, I feel like my understanding is wrong.
I am checking this example from Cloudera and I see we have to create a LivyClient instance and upload the application code to the Spark context from it. Is the correct way? I have a use case where I need to trigger my spark job from Cloud, do I need to put dependencies on Cloud and add it with uploadJar like it is mentioned in Cloudera example?
There are 2 options to interact with Livy server
Using the Livy Client which makes it easier to interact with Livy Server.
There are Rest API exposed which can be used programmatically.
Please check the below links
https://livy.incubator.apache.org/docs/latest/rest-api.html
Related
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I want to implement apache spark in my nodejs application,
I have tried implementing Eclairjs but having some issues implementing it.
Eclairjs appears to be dead
if you want to access spark from node, I would recommend using livy
livy is a service that runs a spark session, and exposes a rest api to that session.
there seem to a be node client already: https://www.npmjs.com/package/node-livy-client
(I never used the node client, so I can't say if it's any good)
I am using Apache Spark in Bluemix.
I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.
The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.
However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.
UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.
Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup
I know that Spark applications can be executed on YARN using spark-submit --master yarn.
The question is:
is it possible to run a Spark application on yarn using the yarn command ?
If so, the YARN REST API could be used as interface for running spark and MapReduce applications in a uniform way.
I see this question is a year old, but to anyone else who stumbles across this question it looks like this should be possible now. I've been trying to do something similar and have been attempting to follow the Starting Spark jobs directly via YARN REST API Tutorial from Hortonworks.
Essentially what you need to do is upload your jar to HDFS, create a Spark Job JSON file per the YARN REST API Documentation, and then use a curl command to start the application. An example of that command is:
curl -s -i -X POST -H "Content-Type: application/json" ${HADOOP_RM}/ws/v1/cluster/apps \
--data-binary spark-yarn.json
Just like all YARN Applications, Spark implements a Client and an ApplicationMaster when deploying on YARN. If you look at the implementation in the Spark repository, you'll have a clue as to how to create your own Client/ApplicationMaster :
https://github.com/apache/spark/tree/master/yarn/src/main/scala/org/apache/spark/deploy/yarn . But out of the box it does not seem possible.
I have not seen the lates package, but few months back such thing was not possible "out of the box" (this is info straight from cloudera support). I know it's not what you were hoping for, but that's what I know.
Thanks for the question.
As suggested above the AM is a good route to write and submit one's application without invoking spark-submit.
The community has built around the spark-submit command for YARN with the addition of flags that ease the addition of jars and/or configs etc. that are needed to get the application to execute successfully. Submitting Applications
An alternate solution(could try): You could have the spark job as an action in an Oozie workflow. Oozie Spark Extension
Depending on what you wish to achieve, either route looks good.
Hope it helps.