Apache Spark and Livy cluster - apache-spark

Scenario :
I have spark cluster and I also want to use Livy.
I am new about Livy
Problem :
I built
my spark cluster by using docker swarm and I will also create a
service for Livy.
Can Livy communicate with external spark master and
send a job to external spark master? If it is ok, which configuration
need to be done? Or Livy should be installed on spark master node?

I think is a little late, but I hope this help you.
sorry for my english, but I am mexican, you can use docker to send jobs via livy, but also you can use livy to send jobs throw Livy REST API.
The livy server can be outside of the spark cluster, you only need to send a conf file to livy that points to you spark cluster.
It looks you are running spark standalone, easist way to configure livy to work is that livy lives on spark master node, if you already have YARN on your cluster machines, you can install livy on any node and run spark application in yarn-cluster or yarn-client mode.

Related

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Is there any way to submit spark job using API

I am able to submit spark job on linux server using console. But is there any API or some framework that can enable to submit spark job in linux server?
You can use port 7077 to submit spark jobs in you spark cluster instead of using spark-submit.
val spark = SparkSession
.builder()
.master(spark://master-machine:7077)
you can look into Livy server. It is in GA mode in Hortonworks and Cloudera distros of Apache Hadoop. We have had good success with it. its documentation is good enough to get started with. Spark jobs start instantaneously when submitted via Livy since it has multiple SparkContexts running inside it.

Mesos Configuration with existing Apache Spark standalone cluster

I am a beginner in Apache-spark!
I have setup Spark standalone cluster using 4 PCs.
I want to use Mesos with existing Spark standalone cluster. But I read that I need to install Mesos first then configure the spark.
I have also seen the Documentation of Spark on setting with Mesos, but it is not helpful for me.
So how to configure Mesos with existing spark standalone cluster?
Mesos is an alternative cluster manager to standalone Spark manger. You don't use it with, you use it instead of.
to create Mesos cluster follow https://mesos.apache.org/gettingstarted/
make sure to distribute Mesos native library is available on the machine you use to submit jobs
for cluster mode start Mesos dispatcher (sbin/start-mesos-dispatcher.sh).
submit application using Mesos master URI (client mode) or dispatcher URI (cluster mode).

How to submit spark Job from locally and connect to Cassandra cluster

Can any one please let me know how to submit spark Job from locally and connect to Cassandra cluster.
Currently I am submitting the Spark job after I login to Cassandra node through putty and submit the below dse-spark-submit Job command.
Command:
dse spark-submit --class ***** --total-executor-cores 6 --executor-memory 2G **/**/**.jar --config-file build/job.conf --args
With the above command, my spark Job able to connect to cluster and its executing, but sometimes facing issues.
So I want to submit spark job from my local machine. Can any one please guide me how to do this.
There are several things you could mean by "run my job locally"
Here are a few of my interpretations
Run the Spark Driver on a Local Machine but access a remote Cluster's resources
I would not recommend this for a few reasons, the biggest being that all of your job management will still be handled between your remote machine and the executors in the cluster. This would be equivalent of having a Hadoop Job Tracker running in a different cluster than the rest of the Hadoop distribution.
To accomplish this though you need to run a spark submit with a specific master uri. Additionally you would need to specify a Cassandra node via spark.cassandra.connection.host
dse spark-submit --master spark://sparkmasterip:7077 --conf spark.cassandra.connection.host aCassandraNode --flags jar
It is important that you keep the jar LAST. All arguments after the jar are interpreted as arguments for the application and not spark-submit parameters.
Run Spark Submit on a local Machine but have the Driver run in the Cluster (Cluster Mode)
Cluster mode means your local machine sends the jar and environment string over to the Spark Master. The Spark Master then chooses a worker to actually run the driver and the driver is started as a separate JVM by the worker. This is triggered using the --deploy-mode cluster flag. In addition to specifying the Master and Cassandra Connection Host.
dse spark-submit --master spark://sparkmasterip:7077 --deploy-mode cluster --conf spark.cassandra.connection.host aCassandraNode --flags jar
Run the Spark Driver in Local Mode
Finally there exists a Local mode for Spark which starts the entire Spark Framework in a single JVM. This is mainly used for testing. Local mode is activated by passing `--master local``
For more information check out the Spark Documentation on submitting applications
http://spark.apache.org/docs/latest/submitting-applications.html

Resources