How to setup Apache Livy and Spark in Docker? - apache-spark

I am trying to setup Livy and Spark on Docker.
Right now I have a local setup of Spark as spark-shell is running fine on my Windows CMD, and I created an image of Livy (i.e. tobilg/livy) on Docker as Livy doesn't work on Windows directly. That image of Livy also working well but they don't seem to connect with each other, So, I have to create a container of Spark as well.
Please help me out here.
Thanks in advance.

Related

Installing Spark/Zeppelin on Standalone node

I have a Cloudera cluster which is being managed by an admin team. However there is no Zeppelin installed in the cluster.
I would like to install Zeppelin on a separate node and connect with the Cloudera cluster?
Is it feasible to install zeppelin on a node which is not part of the cluster and submit spark jobs to it?
Any reference is really appreciated?
Thanks
Zeppelin is just another Spark client.
For example, on the machine that you want to use Zeppelin on, you should first make sure that spark shell and spark submit work as expected, then Zeppelin configurations become much easier
An easy way to manage that would be to have the admins use Cloudera Manager to install Spark (and Hive and Hadoop) client libraries into this standalone node, then I assume they give you SSH access, or you tell them how to install it

Apache Spark and Livy cluster

Scenario :
I have spark cluster and I also want to use Livy.
I am new about Livy
Problem :
I built
my spark cluster by using docker swarm and I will also create a
service for Livy.
Can Livy communicate with external spark master and
send a job to external spark master? If it is ok, which configuration
need to be done? Or Livy should be installed on spark master node?
I think is a little late, but I hope this help you.
sorry for my english, but I am mexican, you can use docker to send jobs via livy, but also you can use livy to send jobs throw Livy REST API.
The livy server can be outside of the spark cluster, you only need to send a conf file to livy that points to you spark cluster.
It looks you are running spark standalone, easist way to configure livy to work is that livy lives on spark master node, if you already have YARN on your cluster machines, you can install livy on any node and run spark application in yarn-cluster or yarn-client mode.

How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container?

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)
Thanks in advance!

How to connect to Spark EMR from the locally running Spark Shell

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
One way of doing this is to add your spark job as an EMR step to your EMR cluster. For this, you need AWS CLI installed on your local computer
(see here for installation guide), and your jar file on s3.
Once you have aws cli, assuming your spark class to run is com.company.my.MySparkJob and your jar file is located on s3 at s3://hadi/my-project-0.1.jar, you can run the following command from your terminal:
aws emr add-steps --cluster-id j-************* --steps Type=spark,Name=My_Spark_Job,Args=[-class,com.company.my.MySparkJob,s3://hadi/my-project-0.1.jar],ActionOnFailure=CONTINUE

Spark Mesos Dispatcher

My team is deploying a new Big Data architecture on Amazon Cloud. We have Mesos up and running Spark jobs.
We are submitting Spark jobs (i.e.: jars) from a bastion host inside the same cluster. Doing so, however, the bastion host is the driver program and this is called the client mode (if I understood correctly).
We would like to try the cluster mode, but we don't understand where to start the dispatcher process.
The documentation says to start it in the cluster, but I'm confused since our masters don't have Spark installed and we use Zookeeper for master election. Starting it on a slave node is not a vailable option, since slave can fail and we don't want to expose a slave ip or public DNS to the bastion host.
Is it correct to start the dispatcher on the bastion host?
Thank you very much
Documentation is not very detailed.
However, we are quite happy with what we discovered:
according to the documentation, cluster mode is not supported for Mesos clusters (and for Python applications).
However, we started the dispatcher using --master mesos://zk://...
For submitting applications, you need the following:
spark-submit --deploy-mode cluster <other options> --master mesos://<dispatcher_ip>:7077 <ClassName> <jar>
If you run this command from a bastion machine, it won't work, because the Mesos master will look for the submitable jar in the same path as the bastion. We ended exposing the file as a downloadable URL.
Hope this helps
I haven't used cluster mode in Mesos and the cluster mode description is not very detailed. There isn't even a --help option on the script, like there should be, IMHO. However, if you don't pass the --master argument, it errors out with a help message and it turns out there is a --zk option for specifying the Zookeeper URL.
What might work is to launch this script on the bastion itself with the appropriate --master and --zk options. Would that work for you?
You could use a docker image with spark and your application.jar instead of uploading the jar to s3. I didn't try yet, but I think it should work. The environment variable is SPARK_DIST_CLASSPATH in spark-env.sh. I use spark distribution compiled without hadoop with apache hadoop 2.7.1
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath):/opt/hadoop/share/hadoop/tools/lib/*:/opt/application.jar

Resources