How to set up Spark cluster on laptop? - apache-spark

I am completely new at Spark and try to run a tutorial example, which counts the number of lines containing 'a' and 'b' in a text file in the local file system.
I am running it with SparkContext with master = "local", i.e. Spark is running in the same JVM. Now I would like to try it in "cluster mode".
So I would like to run a Spark cluster of a cluster manager and two worker nodes locally on my Mac laptop. What is the easiest way to do that ?

Quoting the official documentation about Spark Standalone Mode:
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
In other words, you should start the standalone Master first (using ./sbin/start-master.sh) followed by starting one or more standalone Workers (using ./sbin/start-slave.sh).
Quoting the docs again:
Once you have started a worker, look at the master's web UI (http://localhost:8080 by default)
You're done. Congrats!

If you are looking to learn various ways to use SPARK I would suggest you to download the CLOUDERA quick start VM's which will give a simple cluster setup.
All you need to do is download the quick start VM and play around with the settings accordingly.
The quick start VM can be found here
Reference:Cloudera VM

Related

Can You Use a Script to Start Spark Cluster Nodes?

I'm running Hadoop and Spark on a four-node cluster in AWS EC2.
After doing a lot of web research, it seems the accepted way to start Spark on a cluster (once Hadoop is running) is to:
1) Log into the master node and run start-master.sh.
2) Log into each slave node and run start-slave.sh, passing it the DNS and port information for the master node.
My question is: If there are, let's say 20 nodes, this is pretty tedious and time consuming. Is there a way to start Spark from some localized location the way Hadoop is started? When you run Hadoop from the master node, it starts all the slave nodes remotely. I'm looking for a solution like that, or for a python script that can SSH into the nodes and start them.
You could use Apache Ambari to manage the whole cluster, which would SSH to all nodes for you
Otherwise, you could use a system like Ansible to configure and start all the services
Sounds like you're only using Spark Standalone, though, not YARN, because there is no start-slaves script for YARN

How to run a Spark Standalone master on Kubernetes that will use the Kubernetes Cluser Manager to start workers

I have an application that currently uses Standalone Mode locally to use spark functionality via the SparkContext. We are not using spark-submit to upload our jobs, we are running our application in a container on kubernetes so we would like to take advantage of the dynamic scheduling that kubernetes provides to run the jobs.
We started out looking for a helm chart to create stand alone cluster running on kubernetes similar to how you would have run a standalone cluster on machines ( vms or actual machines ) a few years ago and came across the following
https://github.com/helm/charts/tree/master/stable/spark
Issues:
very old instances of spark
not using the containers provided by spark
this setup wastes a bunch of resources if you need to have large worker nodes reserved and running all the time regardless of your need
Next we started looking at the spark-operator approach here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Issues:
Doesn't support the way we interact with spark, takes the approach that all the apps are standalone apps that are pushed to the cluster to run
No longstanding master that allows us to take advantage of cached resources in the cluster
Along this journey we discovered that spark now supports a kubernetes cluster manager ( similar to the way it does with yarn, mesos ) so we are looking that this might be the best approach, but this still does not provide a standalone master that would allow for the in memory caching. I have looked to see if there was a way that I could get the org.apache.spark.deploy.master.Master to start and use the
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager
So I guess what I'm trying to ask is does anyone have any experience in trying to run a Standalone Master, that would use the kubernetes backend such as "KubernetesClusterManager" in order to have the worker nodes dynamically created as pods and running executors while having a permanent Standalone Master that would allow a SparkContext to connect to it remotely in client mode.

Spark program difference in local mode and cluster

If i write a spark program and run it in stand alone mode and when I want to deploy it in a cluster, do I have to change my program codes or no change in codes needed? Is spark programming independent of number of clusters?
I don't think you need to make any changes. Your program should run the same way as it run in local mode.
Yes, Spark programs are independent of clusters, until and unless you are using something specific to cluster. Normally this is managed by the YARN.
You just need to set option master to yarn or other resource manager, when you want to run it on cluster.
If you want to run it locally just use local[*] by using the number of threads which equal to your machine cores.

How to add slaves to local mode? How to setup Spark cluster on Windows 7?

I am able to run the apache spark on windows with spark-shell --master local[2]. How we can add slaves to the master node?
I think YARN and Mesos are not available on Windows. What are the steps to setup the Spark cluster on Windows 7?
Switching to Unix based system is not option available to us as of now.
Finally found the relevant link to setup spark cluster on Windows.
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-td12669.html
tl;dr You cannot add slaves to local mode.
local mode is the only non-cluster mode where all Spark services run within a single JVM. Read Master URLs for the other options (regarding the master URLs).
If you want to have a clustered deployment environment you should use Spark Standalone, Hadoop YARN or Apache Mesos as described in Cluster Mode Overview. I highly recommend using Spark Standalone first before going into a more advanced cluster managers.
I'm on Mac OS, so I can be sure the cluster managers work on Windows 7 reliably, but I did see Spark Standalone working on Windows. You should use spark-class to start the Master and slaves as the startup scripts are for Unix OSes.

How can I verify that DSE Spark Shell is distributing across the cluster

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Resources