Can I have a master and worker on same node? - apache-spark

I have a 3 node spark standalone cluster and on the master node I also have a worker. When I submit a app to the cluster the two other workers start RUNNING, but the worker on the master node stay with status LOADING and eventually another worker is launched on one of the other machines.
Is having a worker and a master on the same node being the problem ?
If yes, is there a way to workout this problem or I should never have a worker and a master on the same node ?
P.S. The machines have 8 cores each and the workers are set to use 7 and not all of the RAM

Yes, you can, here is from Spark web doc:
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

It is possible to have a machine hosting both Workers and a Master.
Is it possible that you misconfigured the spark-env.sh on that specific machine?

Related

What is the workflow that Hadoop uses to assign Master and Worker nodes through the Hadoop Configuration files?

I 'm a bit confused about how master and worker nodes are assigned to the respective connected machines (VMs) on the network in the cluster mode of Spark.
My question is when i launch a Spark job (using Spark-submit) what is the process workflow that is responsible of assigning a master node and a worker node.
Thanks !
The Driver and Executors requests containers from yarn to launch and do work. Yarn takes care of the allocations for you so you don't need to worry about where the master(driver)/slave(executor) are allocated.

How to run a Spark Standalone master on Kubernetes that will use the Kubernetes Cluser Manager to start workers

I have an application that currently uses Standalone Mode locally to use spark functionality via the SparkContext. We are not using spark-submit to upload our jobs, we are running our application in a container on kubernetes so we would like to take advantage of the dynamic scheduling that kubernetes provides to run the jobs.
We started out looking for a helm chart to create stand alone cluster running on kubernetes similar to how you would have run a standalone cluster on machines ( vms or actual machines ) a few years ago and came across the following
https://github.com/helm/charts/tree/master/stable/spark
Issues:
very old instances of spark
not using the containers provided by spark
this setup wastes a bunch of resources if you need to have large worker nodes reserved and running all the time regardless of your need
Next we started looking at the spark-operator approach here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Issues:
Doesn't support the way we interact with spark, takes the approach that all the apps are standalone apps that are pushed to the cluster to run
No longstanding master that allows us to take advantage of cached resources in the cluster
Along this journey we discovered that spark now supports a kubernetes cluster manager ( similar to the way it does with yarn, mesos ) so we are looking that this might be the best approach, but this still does not provide a standalone master that would allow for the in memory caching. I have looked to see if there was a way that I could get the org.apache.spark.deploy.master.Master to start and use the
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager
So I guess what I'm trying to ask is does anyone have any experience in trying to run a Standalone Master, that would use the kubernetes backend such as "KubernetesClusterManager" in order to have the worker nodes dynamically created as pods and running executors while having a permanent Standalone Master that would allow a SparkContext to connect to it remotely in client mode.

Understanding Spark Submit Yarn Client vs Cluster mode [duplicate]

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?
We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:
A master machine, which also is where our application is run using spark-submit
2 identical worker machines
From the Spark Documentation, I read:
(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.
Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:
So I am not able to test both modes to see the practical differences. That being said, my questions are:
1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?
2) How to I choose which one my application is going to be running on, using spark-submit?
What are the practical differences between Spark Standalone client
deploy mode and cluster deploy mode? What are the pro's and con's of
using each one?
Let's try to look at the differences between client and cluster mode.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.
How to I choose which one my application is going to be running on,
using spark-submit
The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Let's say you are going to perform a spark submit in EMR by doing SSH to the master node.
If you are providing the option --deploy-mode cluster, then following things will happen.
You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.
I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.
Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

Installing Spark on four machines

I want o run my spark tasks using for Amazon EC2 instances which I know all their IPs.
I want to have one computer as master and the other three could run worker nodes..can someone help me how I should configure spark for this task..should be standalone? I know how to set master node using
setMaster("SPARK://masterIP:7070");
but how to define worker nodes and assign them to the above master node?
If you are configuring you spark cluster manually you can start a standalone master server by executing :
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
ADDING WORKERS :
Now you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
for more info you can check spark website starting-a-cluster-manually
EDIT
TO RUN WORKERS FROM MASTER
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. Note, the master machine accesses each of the worker machines via ssh (there should be password less ssh between master and worker machines).
After configuring conf/slaves file you should run two files :
sbin/start-master.sh - Starts a master instance on the machine the
script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine
specified in the conf/slaves file.
For more info check Cluster Launch Scripts

How can I verify that DSE Spark Shell is distributing across the cluster

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Resources