Starting multiple workers on a master node in Standalone mode - apache-spark

I have a machine with 80 cores. I'd like to start a Spark server in standalone mode on this machine with 8 executors, each with 10 cores. But, when I try to start my second worker on the master, I get an error.
$ ./sbin/start-master.sh
Starting org.apache.spark.deploy.master.Master, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
Starting org.apache.spark.deploy.worker.Worker, logging to ...
$ ./sbin/start-slave.sh spark://localhost:7077 -c 10
org.apache.spark.deploy.worker.Worker running as process 64606. Stop it first.
In the documentation, it clearly states "you can start one or more workers and connect them to the master via: ./sbin/start-slave.sh <master-spark-URL>". So why can't I do that?

A way to get the same parallelism is to start many workers.
You can do this by adding to the ./conf/spark-env.sh file:
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_CORES=10
SPARK_EXECUTOR_CORES=10

In a single machine, it is quite complicated but you can try docker or Kubernetes.
Create multiple docker containers for spark workers.

Just specify a new identity for every new worker/master and then launch the start-worker.sh
export SPARK_IDENT_STRING=worker2
./spark-node2/sbin/start-worker.sh spark://DESKTOP-HSK5ETQ.localdomain:7077
thanks to https://stackoverflow.com/a/46205968/1743724

Related

spark pi example runs but no worker resources allocated

I am running the pi example. It executes fine and returning the result.
But, regarding the worker I just can see that is Alive, no resources used!
No Job Details filled, nothing.
I am running spark locally.
start-master.sh -h 127.0.0.1
start-slave.sh spark://127.0.0.1:7077
You should pass your local master address in spark-submit command --master spark://127.0.0.1:7077 also you can pass master address in code using .setMaster(...) and then run as a python script.

Why does stopping Standalone Spark master fail with "no org.apache.spark.deploy.master.Master to stop"?

Stopping standalone spark master fails with the following message:
$ ./sbin/stop-master.sh
no org.apache.spark.deploy.master.Master to stop
Why? There is one Spark Standalone master up and running.
Spark master was started under different user.
/tmp/Spark-ec2-user-org.apache.spark.deploy.master.Master-1.pid
Was not accessible.Had to login under different user who actually started the stand alone cluster manager master.
In my case, I was able to open the master WebUI page on browser where it clearly mentioned that Spark Master is running on port 7077.
However, while trying to stop using stop-all.sh, was facing no org.apache.spark.deploy.master.Master to stop . So I tried a different method - to find what process is running on port 7077 using below command :
lsof -i :7077
I got the result as java with a PID of 112099
Used the below command to kill that process :
kill 112099
After this when I checked the WebUI, it had stopped working. Successfully killed the Spark Master.

Docker Container with Apache Spark in standalone cluster mode

I am trying to construct a docker image containing Apache Spark. IT is built upon the openjdk-8-jre official image.
The goal is to execute Spark in cluster mode, thus having at least one master (started via sbin/start-master.sh) and one or more slaves (sbin/start-slave.sh). See spark-standalone-docker for my Dockerfile and entrypoint script.
The build itself actually goes through, the problem is that when I want to run the container, it starts and stops shortly after. The cause is that Spark master launch script starts the master in daemon mode and exits. Thus the container terminates, as there is no process running in the foreground anymore.
The obvious solution is to run the Spark master process in foreground, but I could not figure out how (Google did not turn up anything either). My "workaround-solution" is to run tails -f on the Spark log directory.
Thus, my questions are:
How can you run Apache Spark Master in foreground?
If the first is not possible / feasible / whatever, what is the preferred (i.e. best practice) solution to keeping a container "alive" (I really don't want to use an infinite loop and a sleep command)?
UPDATED ANSWER (for spark 2.4.0):
To start spark master on foreground, just set the ENV variable
SPARK_NO_DAEMONIZE=true on your environment before running ./start-master.sh
and you are good to go.
for more info, check $SPARK_HOME/sbin/spark-daemon.sh
# Runs a Spark command as a daemon.
#
# Environment Variables
#
# SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_HOME}/conf.
# SPARK_LOG_DIR Where log files are stored. ${SPARK_HOME}/logs by default.
# SPARK_MASTER host:path where spark code should be rsync'd from
# SPARK_PID_DIR The pid files are stored. /tmp by default.
# SPARK_IDENT_STRING A string representing this instance of spark. $USER by default
# SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
# SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file.
##
How can you run Apache Spark Master in foreground?
You can use spark-class with Master.
bin/spark-class org.apache.spark.deploy.master.Master
and the same thing for workers:
bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL
If you're looking for production ready solution you should consider using a proper supervisor like dumb-init or tini.

what happens when I kill a yarn's node manager

Suppose there are 10 containers running on this machine(5 is mapreduce tasks, and 5 is spark on yarn executors).
And if I kill the node manager, what happens for these 10 containers process?
Before I restart the node manager ,what should I do first?
Killing nodemanager will only affect the containers of this particular node. All the running containers will get lost on restart/kill. They will get relaunched once the node comes up or the nodemanager process get start(if application/job still running).
NOTE: Jobs ApplicationMaster should not be running on this slave.
what happens when the node with the ApplicationMaster dies ?
In this case the yarn launchs a new ApplicationMaster on some other node. All the containers relaunched again in this case.
Answering according to hadoop 2.7.x dist: check this article: http://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/
If you don't have yarn.nodemanager.recovery.enabled set to true then you're container will be KILLED (spark or mapreduce or anything else) However you're job will most likely continue to run.
You need to check this property in your env using hadoop conf | grep yarn.nodemanager.recovery.dir . If it false which is for me by default then nothing you can do to prevent getting those container killed upon restart imo. However you can try to modify the flag and set other required property for future cases if you want containers to be recovered.
Look at this one too: http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_ha_yarn_work_preserving_recovery.html

Spark - How to run a standalone cluster locally

Is there the possibility to run the Spark standalone cluster locally on just one machine (which is basically different from just developing jobs locally (i.e., local[*]))?.
So far I am running 2 different VMs to build a cluster, what if I could run a standalone cluster on the very same machine, having for instance three different JVMs running?
Could something like having multiple loopback addresses do the trick?
yes you can do it, launch one master and one worker node and you are good to go
launch master
./sbin/start-master.sh
launch worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
run SparkPi example
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 lib/spark-examples-1.2.1-hadoop2.4.0.jar
Apache Spark Standalone Mode Documentation
A small update as for the latest version (the 2.1.0), the default is to bind the master to the hostname, so when starting a worker locally use the output of hostname:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://`hostname`:7077 -c 1 -m 512M
And to run an example, simply run the following command:
bin/run-example SparkPi
If you can't find the ./sbin/start-master.sh file on your machine, you can start the master also with
./bin/spark-class org.apache.spark.deploy.master.Master
More simply,
./sbin/start-all.sh
On your local machine there will be master and one worker launched.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://localhost:7077 \
examples/jars/spark-examples_2.12-3.0.1.jar 10000
A sample application is submitted. For monitoring via Web UI:
Master UI: http://localhost:8080
Worker UI: http://localhost:8081
Application UI: http://localhost:4040

Resources