Running Spark on a Cluster of machines - apache-spark

I want to run Spark on four computers, and i read theory of running Spark on cluster using Mesos, Yarn and SSH, but i want a practical method and tutorial for this. the Operating System of these machines are Mac and Ubuntu. I've written my code on IntelliJIDEA using Scala.
Can anybody help me?

Related

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Is Apache Spark recommended to run on windows?

I have a requirement to run Spark on Windows in a production environment. I would like to get advice in understanding if Apache Spark on Windows is recommended. If not, I would like to know the reason behind the same.

Force H2O Sparkling Water cluster to start on a specific machine in YARN mode

Tools used:
Spark 2
Sparkling Water (H2O)
Zeppeling notebook
Pyspark Code
I'm starting H2O in INTERNAL mode from my Zeppelin notebook, since my environment is YARN. I'm using the basic command:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
My problem is that I have the zeppelin server installed on a weak machine and when I run my code FROM ZEPPELIN the H2O cluster starts on that machine using its IP automatically. The driver runs on there and i'm limited by the driver memory which H2O consumes. I have 4 strong worker node machines with 100GB and many cores and the cluster uses them while I run my models, but I would like the H2O cluster to start on one of these worker machines and run the driver there, but I didn't find a way to force H2O to do that.
I wonder if there is a solution, or if I must install the zeppelin server on a worker machine.
Help will be appreciated if a solution is possible
Start your job in yarn-cluster mode. This will make the driver run as another YARN container.
Here is another stackoverflow post describing the difference:
Spark yarn cluster vs client - how to choose which one to use?

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

How to add slaves to local mode? How to setup Spark cluster on Windows 7?

I am able to run the apache spark on windows with spark-shell --master local[2]. How we can add slaves to the master node?
I think YARN and Mesos are not available on Windows. What are the steps to setup the Spark cluster on Windows 7?
Switching to Unix based system is not option available to us as of now.
Finally found the relevant link to setup spark cluster on Windows.
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-td12669.html
tl;dr You cannot add slaves to local mode.
local mode is the only non-cluster mode where all Spark services run within a single JVM. Read Master URLs for the other options (regarding the master URLs).
If you want to have a clustered deployment environment you should use Spark Standalone, Hadoop YARN or Apache Mesos as described in Cluster Mode Overview. I highly recommend using Spark Standalone first before going into a more advanced cluster managers.
I'm on Mac OS, so I can be sure the cluster managers work on Windows 7 reliably, but I did see Spark Standalone working on Windows. You should use spark-class to start the Master and slaves as the startup scripts are for Unix OSes.

Resources