Spark - How to run a standalone cluster locally - apache-spark

Is there the possibility to run the Spark standalone cluster locally on just one machine (which is basically different from just developing jobs locally (i.e., local[*]))?.
So far I am running 2 different VMs to build a cluster, what if I could run a standalone cluster on the very same machine, having for instance three different JVMs running?
Could something like having multiple loopback addresses do the trick?

yes you can do it, launch one master and one worker node and you are good to go
launch master
./sbin/start-master.sh
launch worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
run SparkPi example
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 lib/spark-examples-1.2.1-hadoop2.4.0.jar
Apache Spark Standalone Mode Documentation

A small update as for the latest version (the 2.1.0), the default is to bind the master to the hostname, so when starting a worker locally use the output of hostname:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://`hostname`:7077 -c 1 -m 512M
And to run an example, simply run the following command:
bin/run-example SparkPi

If you can't find the ./sbin/start-master.sh file on your machine, you can start the master also with
./bin/spark-class org.apache.spark.deploy.master.Master

More simply,
./sbin/start-all.sh
On your local machine there will be master and one worker launched.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://localhost:7077 \
examples/jars/spark-examples_2.12-3.0.1.jar 10000
A sample application is submitted. For monitoring via Web UI:
Master UI: http://localhost:8080
Worker UI: http://localhost:8081
Application UI: http://localhost:4040

Related

Memory,CPU,GPU profiling in containerized Spark cluster

any suggestion in which library/tool should I use for plotting over time RAM,CPU and (optionally) GPU usage of a spark-app submitted to a Docker containerized Spark cluster through spark-submit?
In the documentation Apache suggests to use memory_profiler with commands like:
python -m memory_profiler profile_memory.py
but after accessing to my master node through a remote shell:
docker exec -it spark-master bash
I can't launch locally my spark apps because I need to use the spark-submit command in order to submit it to the cluster.
Any suggestion? I launch the apps w/o YARN but in cluster mode through
/opt/spark/spark-submit --master spark://spark-master:7077 appname.py
I would like also to know if I can use memory_profiler even if I need to use spark-submit

Starting multiple workers on a master node in Standalone mode

I have a machine with 80 cores. I'd like to start a Spark server in standalone mode on this machine with 8 executors, each with 10 cores. But, when I try to start my second worker on the master, I get an error.
$ ./sbin/start-master.sh
Starting org.apache.spark.deploy.master.Master, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
Starting org.apache.spark.deploy.worker.Worker, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
org.apache.spark.deploy.worker.Worker running as process 64606. Stop it first.
In the documentation, it clearly states "you can start one or more workers and connect them to the master via: ./sbin/start-slave.sh <master-spark-URL>". So why can't I do that?
A way to get the same parallelism is to start many workers.
You can do this by adding to the ./conf/spark-env.sh file:
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_CORES=10
SPARK_EXECUTOR_CORES=10
In a single machine, it is quite complicated but you can try docker or Kubernetes.
Create multiple docker containers for spark workers.
Just specify a new identity for every new worker/master and then launch the start-worker.sh
export SPARK_IDENT_STRING=worker2
./spark-node2/sbin/start-worker.sh spark://DESKTOP-HSK5ETQ.localdomain:7077
thanks to https://stackoverflow.com/a/46205968/1743724

Hot to execute "spark-submit" against remote spark master?

Suppose I've got a remote spark cluster. I can log in a remote spark cluster host with ssh and run spark-submit with an example like that:
$SPARK_HOME/bin/spark-submit /usr/lib/spark2/examples/src/main/python/pi.py
Now I've installed spark on my laptop but I don't run it.
I want to run $SPARK_HOME/bin/spark-submit on my laptop against the remote spark cluster host. How can I do it ?
Yes you can provide the remote master url in this command, e.g.
$SPARK_HOME/bin/spark-submit --master spark://url_to_master:7077 /usr/lib/spark2/examples/src/main/python/pi.py

Understanding spark --master

I have simple spark app that reads master from a config file:
new SparkConf()
.setMaster(config.getString(SPARK_MASTER))
.setAppName(config.getString(SPARK_APPNAME))
What will happen when ill run my app with as follow:
spark-submit --class <main class> --master yarn <my jar>
Is my master going to be overwritten?
I prefer having the master provided in standard way so I don't need to maintain it in my configuration, but then the question how can I run this job directly from IDEA? this isn't my application argument but spark-submit argument.
Just for clarification my desired end product should:
when run in cluster using --master yarn, will use this configuration
when run from IDEA will run with local[*]
Do not set the master into your code.
In production you could use the option --master of spark-submit which will tell spark which master to use (yarn in you case). also the value of spark.master in spark-defaults.conf file will do the job (priority is for --master and then the property in configuration file)
In an IDEA... well I know in Eclipse you could pass a VM argument in Run Configuration -Dspark.master=local[*] for example (https://stackoverflow.com/a/24481688/1314742).
In IDEA I think it is not too much different, you could check here to add VM options

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Resources