How to add slaves to local mode? How to setup Spark cluster on Windows 7? - apache-spark

I am able to run the apache spark on windows with spark-shell --master local[2]. How we can add slaves to the master node?
I think YARN and Mesos are not available on Windows. What are the steps to setup the Spark cluster on Windows 7?
Switching to Unix based system is not option available to us as of now.

Finally found the relevant link to setup spark cluster on Windows.
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-td12669.html

tl;dr You cannot add slaves to local mode.
local mode is the only non-cluster mode where all Spark services run within a single JVM. Read Master URLs for the other options (regarding the master URLs).
If you want to have a clustered deployment environment you should use Spark Standalone, Hadoop YARN or Apache Mesos as described in Cluster Mode Overview. I highly recommend using Spark Standalone first before going into a more advanced cluster managers.
I'm on Mac OS, so I can be sure the cluster managers work on Windows 7 reliably, but I did see Spark Standalone working on Windows. You should use spark-class to start the Master and slaves as the startup scripts are for Unix OSes.

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

Spark Mesos Dispatcher

My team is deploying a new Big Data architecture on Amazon Cloud. We have Mesos up and running Spark jobs.
We are submitting Spark jobs (i.e.: jars) from a bastion host inside the same cluster. Doing so, however, the bastion host is the driver program and this is called the client mode (if I understood correctly).
We would like to try the cluster mode, but we don't understand where to start the dispatcher process.
The documentation says to start it in the cluster, but I'm confused since our masters don't have Spark installed and we use Zookeeper for master election. Starting it on a slave node is not a vailable option, since slave can fail and we don't want to expose a slave ip or public DNS to the bastion host.
Is it correct to start the dispatcher on the bastion host?
Thank you very much
Documentation is not very detailed.
However, we are quite happy with what we discovered:
according to the documentation, cluster mode is not supported for Mesos clusters (and for Python applications).
However, we started the dispatcher using --master mesos://zk://...
For submitting applications, you need the following:
spark-submit --deploy-mode cluster <other options> --master mesos://<dispatcher_ip>:7077 <ClassName> <jar>
If you run this command from a bastion machine, it won't work, because the Mesos master will look for the submitable jar in the same path as the bastion. We ended exposing the file as a downloadable URL.
Hope this helps
I haven't used cluster mode in Mesos and the cluster mode description is not very detailed. There isn't even a --help option on the script, like there should be, IMHO. However, if you don't pass the --master argument, it errors out with a help message and it turns out there is a --zk option for specifying the Zookeeper URL.
What might work is to launch this script on the bastion itself with the appropriate --master and --zk options. Would that work for you?
You could use a docker image with spark and your application.jar instead of uploading the jar to s3. I didn't try yet, but I think it should work. The environment variable is SPARK_DIST_CLASSPATH in spark-env.sh. I use spark distribution compiled without hadoop with apache hadoop 2.7.1
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath):/opt/hadoop/share/hadoop/tools/lib/*:/opt/application.jar

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

How do I start spark shell or submit a spark job from a machine that is not part of the cluster?

I have a cluster of 4 DSE 4.6 nodes with Cassandra/Spark in standalone mode, submitting a job to spark or opening a spark shell from one of the cluster nodes works fine.
What I want to do now is to be able to open a spark shell from a machine that is not part of the cluster, so I installed DSE on a new machine but when I try to run
$ SPARK_MASTER=spark://MASTER_NODE dse spark
I get a bunch of connection errors that look like the spark shell is trying to connect to localhost.
Is there an inherent limitation in Spark that limits running the shell or submitting the jobs only from a node that is a member of the cluster?
Which version of Spark are you on?
Try changing SPARK_MASTER to just MASTER
I usually run
MASTER=spark://servername:7077 ./bin/spark-shell
And everything connects fine.
Ok I've found my issue (two actually):
I had a different JDK installed on the "client" machine
the correct way of specifying a master is dse spark --master spark://MASTER_ADDRESS:7077
Now everything works fine.

Resources