Installing Spark on four machines - apache-spark

I want o run my spark tasks using for Amazon EC2 instances which I know all their IPs.
I want to have one computer as master and the other three could run worker nodes..can someone help me how I should configure spark for this task..should be standalone? I know how to set master node using
setMaster("SPARK://masterIP:7070");
but how to define worker nodes and assign them to the above master node?

If you are configuring you spark cluster manually you can start a standalone master server by executing :
./sbin/start-master.sh
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
ADDING WORKERS :
Now you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
for more info you can check spark website starting-a-cluster-manually
EDIT
TO RUN WORKERS FROM MASTER
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. Note, the master machine accesses each of the worker machines via ssh (there should be password less ssh between master and worker machines).
After configuring conf/slaves file you should run two files :
sbin/start-master.sh - Starts a master instance on the machine the
script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine
specified in the conf/slaves file.
For more info check Cluster Launch Scripts

Related

What is the workflow that Hadoop uses to assign Master and Worker nodes through the Hadoop Configuration files?

I 'm a bit confused about how master and worker nodes are assigned to the respective connected machines (VMs) on the network in the cluster mode of Spark.
My question is when i launch a Spark job (using Spark-submit) what is the process workflow that is responsible of assigning a master node and a worker node.
Thanks !
The Driver and Executors requests containers from yarn to launch and do work. Yarn takes care of the allocations for you so you don't need to worry about where the master(driver)/slave(executor) are allocated.

What is the difference between Driver and Application manager in spark

I couldn't figure out what is the difference between Spark driver and application master. Basically the responsibilities in running an application, who does what?
In client mode, client machine has the driver and app master runs in one of the cluster nodes. In cluster mode, client doesn't have any, driver and app master runs in same node (one of the cluster nodes).
What exactly are the operations that driver do and app master do?
References:
Spark Driver memory and Application Master memory
Spark yarn cluster vs client - how to choose which one to use?
As per the spark documentation
Spark Driver :
The Driver(aka driver program) is responsible for converting a user
application to smaller execution units called tasks and then schedules
them to run with a cluster manager on executors. The driver is also
responsible for executing the Spark application and returning the
status/results to the user.
Spark Driver contains various components – DAGScheduler,
TaskScheduler, BackendScheduler and BlockManager. They are responsible
for the translation of user code into actual Spark jobs executed on
the cluster.
Where in Application Master is
The Application Master is responsible for the execution of a single
application. It asks for containers from the Resource Scheduler
(Resource Manager) and executes specific programs on the obtained containers.
Application Master is just a broker that negotiates resources with the Resource Manager and then after getting some container it make sure to launch tasks(which are picked from scheduler queue) on containers.
In a nutshell Driver program will translate your custom logic into stages, job and task.. and your application master will make sure to get enough resources from RM And also make sure to check the status of your tasks running in a container.
as it is already said in your provided references the only different between client and cluster mode is
In client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes.
(AND)
In cluster mode driver run inside application master, it means the application has much more responsibility.
References :
https://luminousmen.com/post/spark-anatomy-of-spark-application#:~:text=The%20Driver(aka%20driver%20program,status%2Fresults%20to%20the%20user.
https://www.edureka.co/community/1043/difference-between-application-master-application-manager#:~:text=The%20Application%20Master%20is%20responsible,class)%20on%20the%20obtained%20containers.

spark-jobserver: Worker does not connect back to the driver

I set up a small Spark environment on two machines. One runs a master and a worker, and the other one runs a worker only. I can use this cluster using the Spark Shell like:
spark-shell --master spark://mymaster.example.internal:7077
I can run computations in there that get distributed to the nodes correctly, so everything runs fine.
However, I am having trouble when using the spark-jobserver.
First try was to start the Docker container (with the environment variable SPARK_MASTER pointing to the correct master URL). When the job was started, the worker it was pushed to complained that it couldn't connect back to 172.18.x.y:nnnn. This was clear because this was the internal IP address of the Docker container the jobserver ran in.
So, I ran the jobserver container again with --network host so it attached itself to the host network. However, starting the job led to a Connection refused again, this time saying it couldn't connect to 172.30.10.10:nnnn. 172.30.10.10 is the IP address of the host I want to run the jobserver on and it IS reachable from both worker and master nodes (The Spark instances run in Docker containers too, but they are also attached to the host network).
Digging deeper, I tried to start a Docker container which just has a JVM and Spark inside, ran it with --network host too and launched a Spark job from inside. This worked.
What might I be missing?
It turned out that I missed starting the shuffle service. I configured my custom jobserver container to use dynamic allocation and this needs the external shuffle service to be started.

Can I have a master and worker on same node?

I have a 3 node spark standalone cluster and on the master node I also have a worker. When I submit a app to the cluster the two other workers start RUNNING, but the worker on the master node stay with status LOADING and eventually another worker is launched on one of the other machines.
Is having a worker and a master on the same node being the problem ?
If yes, is there a way to workout this problem or I should never have a worker and a master on the same node ?
P.S. The machines have 8 cores each and the workers are set to use 7 and not all of the RAM
Yes, you can, here is from Spark web doc:
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
It is possible to have a machine hosting both Workers and a Master.
Is it possible that you misconfigured the spark-env.sh on that specific machine?

How can I verify that DSE Spark Shell is distributing across the cluster

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Resources