Spark worker in kubernetes cluster exits - apache-spark

Message on worker:
21/10/15 17:00:16 INFO Worker: Executor app-XXXXXXXXXXXXXX-XXXX/0 finished with state EXITED message Command exited with code 1 exitStatus 1
21/10/15 17:00:16 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
21/10/15 17:00:16 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-XXXXXXXXXXXXXX-XXXX, execId=0)
21/10/15 17:00:16 INFO Worker: Asked to launch executor app-XXXXXXXXXXXXXX-XXXX/1 for truework.ScalaStreaming
21/10/15 17:00:16 INFO SecurityManager: Changing view acls to: root
21/10/15 17:00:16 INFO SecurityManager: Changing modify acls to: root
21/10/15 17:00:16 INFO SecurityManager: Changing view acls groups to:
21/10/15 17:00:16 INFO SecurityManager: Changing modify acls groups to:
21/10/15 17:00:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/10/15 17:00:16 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=XXXXX" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#127.0.0.1:XXXX" "--executor-id" "1" "--hostname" "127.0.0.1" "--cores" "1" "--app-id" "app-XXXXXXXXXXXXXX-XXXX" "--worker-url" "spark://Worker#127.0.0.1:XXXXX
Anyone have this problem?
I believe it's these IP's because, I run it on kubernetes: microk8s kubectl.

Related

How to solve "no org.apache.spark.deploy.worker.Worker to stop" issue?

I a using a spark standalone in Google Cloud, composed of 1 master and 4 worker nodes. When I start the cluster. I can see the master and worker running. But when I try to stop-all, I get the following issue. Maybe this the reason I cannot run spark-submit. How to solve this issue. The following are the terminal screen.
sparkuser#master:/opt/spark/logs$ jps
1867 Jps
sparkuser#master:/opt/spark/logs$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
sparkuser#master:/opt/spark/logs$ jps -lm
1946 sun.tools.jps.Jps -lm
1886 org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
sparkuser#master:/opt/spark/logs$ cat spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
Spark Command: /usr/lib/jvm/jdk1.8.0_202/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/13 04:28:23 INFO Master: Started daemon with process name: 1886#master
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for TERM
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for HUP
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for INT
22/10/13 04:28:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/13 04:28:24 INFO SecurityManager: Changing view acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing view acls groups to:
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls groups to:
22/10/13 04:28:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/10/13 04:28:24 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/10/13 04:28:24 INFO Master: Starting Spark master at spark://master:7077
22/10/13 04:28:24 INFO Master: Running Spark version 3.2.2
22/10/13 04:28:25 INFO Utils: Successfully started service 'MasterUI' on port 8080.
22/10/13 04:28:25 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://localhost:8080
22/10/13 04:28:25 INFO Master: I have been elected leader! New state: ALIVE
sparkuser#master:/opt/spark/logs$ stop-all.sh
worker2: no org.apache.spark.deploy.worker.Worker to stop
worker4: no org.apache.spark.deploy.worker.Worker to stop
worker1: no org.apache.spark.deploy.worker.Worker to stop
worker3: no org.apache.spark.deploy.worker.Worker to stop
stopping org.apache.spark.deploy.master.Master
sparkuser#master:/opt/spark/logs$

Why the number of workers doesnt match the number specified in the slurm batch?

I have a strange problem. I am stuck for one week to solve it, but unfortunately not found the solution.
I'm working with spark 2.3.0. This version is available on a Linux server which I access it remotely (ssh).
To run my application (test.py) I write the following script :
#!/bin/bash
#SBATCH --account=def-moudi
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --mem=100G
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=6
#SBATCH --output=/project/6008168/moudi/job/spark-job/sparkjob-%j.out
#SBATCH --mail-type=ALL
#SBATCH --error=/project/6008168/moudi/job/spark-job/error6_hours.out
# load the Spark module
module load spark/2.3.0
module load python/3.7.0
source "/home/moudi/ENV3.7.0/bin/activate"
# identify the Spark cluster with the Slurm jobid
export SPARK_IDENT_STRING=$SLURM_JOBID
export JOB_HOME="$HOME/.spark/2.3.0/$SPARK_IDENT_STRING"
mkdir -p $JOB_HOME
## --------------------------------------
## 1. Start the Spark cluster master
## --------------------------------------
$SPARK_HOME/sbin/start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*'
$SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)
## --------------------------------------
## 2. Start the Spark cluster workers
## --------------------------------------
# get the resource details from the Slurm job
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
export SPARK_MEM=$(( ${SLURM_MEM_PER_CPU:-3072} * ${SLURM_CPUS_PER_TASK:-1} ))
export SPARK_DAEMON_MEMORY=${SPARK_MEM}m
export SPARK_WORKER_MEMORY=${SPARK_MEM}m
NWORKERS=${SLURM_NTASKS:-1} #just for testing you should delete this line
# start the workers on each node allocated to the job
export SPARK_NO_DAEMONIZE=1
srun -n ${NWORKERS} -N $SLURM_JOB_NUM_NODES --label -- output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SPARK_MEM}m -c
${SPARK_WORKER_CORES} ${MASTER_URL} &
## --------------------------------------
## 3. Submit a task to the Spark cluster
## --------------------------------------
spark-submit --master ${MASTER_URL} --total-executor-cores $((SLURM_NTASKS *
SLURM_CPUS_PER_TASK)) --executor-memory ${SPARK_WORKER_MEMORY} --driver-memory ${SPARK_WORKER_MEMORY}m --num- executors $((SLURM_NTASKS - 1)) /project/6008168/moudi/test.py
## --------------------------------------
## 4. Clean up
## --------------------------------------
# stop the workers
scancel ${SLURM_JOBID}.0
# stop the master
$SPARK_HOME/sbin/stop-master.sh
When I run this script I notice that there are only 8 workers which is not correct as it should be 11 workers? The log file of the workers output is the following:
2: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
3: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
0: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
1: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
5: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
0: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
0: ========================================
1: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
1: ========================================
2: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
2: ========================================
3: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
3: ========================================
5: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
5: ========================================
10: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
10: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
10: ========================================
3: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
1: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
5: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
3: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 190920#cdr562.int.cedar.computecanada.ca
1: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 190924#cdr562.int.cedar.computecanada.ca
2: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 190921#cdr562.int.cedar.computecanada.ca
5: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 190923#cdr562.int.cedar.computecanada.ca
3: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
1: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
3: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
3: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
1: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
1: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
2: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
5: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
2: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
2: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
5: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
5: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
0: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
0: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 190922#cdr562.int.cedar.computecanada.ca
0: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
0: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
0: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
3: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
3: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
3: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
1: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
3: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
1: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
3: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
1: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
1: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
1: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
5: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
5: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
5: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
5: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
5: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
2: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
2: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
2: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
2: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
2: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
0: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
0: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
0: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
0: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
0: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
10: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
10: 19/05/05 08:25:38 INFO Worker: Started daemon with process name: 134076#cdr743.int.cedar.computecanada.ca
10: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for TERM
10: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for HUP
10: 19/05/05 08:25:38 INFO SignalUtils: Registered signal handler for INT
10: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls to: moudi
10: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls to: moudi
10: 19/05/05 08:25:38 INFO SecurityManager: Changing view acls groups to:
10: 19/05/05 08:25:38 INFO SecurityManager: Changing modify acls groups to:
10: 19/05/05 08:25:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
3: 19/05/05 08:25:38 INFO Utils: Successfully started service 'sparkWorker' on port 35634.
1: 19/05/05 08:25:38 INFO Utils: Successfully started service 'sparkWorker' on port 41932.
5: 19/05/05 08:25:38 INFO Utils: Successfully started service 'sparkWorker' on port 36466.
2: 19/05/05 08:25:38 INFO Utils: Successfully started service 'sparkWorker' on port 32857.
0: 19/05/05 08:25:38 INFO Utils: Successfully started service 'sparkWorker' on port 41950.
3: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.49:35634 with 5 cores, 15.0 GB RAM
1: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.49:41932 with 5 cores, 15.0 GB RAM
5: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.49:36466 with 5 cores, 15.0 GB RAM
1: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
3: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
1: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
3: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
5: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
5: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
2: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.49:32857 with 5 cores, 15.0 GB RAM
2: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
2: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
0: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.49:41950 with 5 cores, 15.0 GB RAM
0: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
0: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
10: 19/05/05 08:25:39 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
3: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
1: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
3: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8082.
5: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
5: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
5: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8083.
2: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
2: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
2: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8083. Attempting port 8084.
2: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8084.
4: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
3: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr562.int.cedar.computecanada.ca:8082
1: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr562.int.cedar.computecanada.ca:8081
3: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
1: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
5: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr562.int.cedar.computecanada.ca:8083
5: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
2: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr562.int.cedar.computecanada.ca:8084
2: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
0: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
0: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
0: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8083. Attempting port 8084.
0: 19/05/05 08:25:39 WARN Utils: Service 'WorkerUI' could not bind on port 8084. Attempting port 8085.
0: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8085.
0: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr562.int.cedar.computecanada.ca:8085
0: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
11: starting org.apache.spark.deploy.worker.Worker, logging to /home/moudi/.spark/2.3.0/logs/spark-20562069-org.apache.spark.deploy.worker.Worker-1-cdr562.out
10: 19/05/05 08:25:39 INFO Worker: Starting Spark worker 172.16.138.230:35803 with 5 cores, 15.0 GB RAM
10: 19/05/05 08:25:39 INFO Worker: Running Spark version 2.3.0
10: 19/05/05 08:25:39 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
3: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 39 ms (0 ms spent in bootstraps)
1: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 45 ms (0 ms spent in bootstraps)
5: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 43 ms (0 ms spent in bootstraps)
4: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
4: ========================================
2: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 51 ms (0 ms spent in bootstraps)
0: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 42 ms (0 ms spent in bootstraps)
10: 19/05/05 08:25:39 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
10: 19/05/05 08:25:39 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cdr743.int.cedar.computecanada.ca:8081
10: 19/05/05 08:25:39 INFO Worker: Connecting to master cdr562.int.cedar.computecanada.ca:7077...
3: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
1: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
5: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
0: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
2: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
11: Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx15360m org.apache.spark.deploy.worker.Worker --webui-port 8081 -m 15360m -c 5 spark://cdr562.int.cedar.computecanada.ca:7077
11: ========================================
10: 19/05/05 08:25:39 INFO TransportClientFactory: Successfully created connection to cdr562.int.cedar.computecanada.ca/172.16.138.49:7077 after 48 ms (0 ms spent in bootstraps)
10: 19/05/05 08:25:39 INFO Worker: Successfully registered with master spark://cdr562.int.cedar.computecanada.ca:7077
4: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
4: 19/05/05 08:25:40 INFO Worker: Started daemon with process name: 191630#cdr562.int.cedar.computecanada.ca
4: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for TERM
4: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for HUP
4: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for INT
11: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
11: 19/05/05 08:25:40 INFO Worker: Started daemon with process name: 134213#cdr743.int.cedar.computecanada.ca
4: 19/05/05 08:25:40 INFO SecurityManager: Changing view acls to: moudi
4: 19/05/05 08:25:40 INFO SecurityManager: Changing modify acls to: moudi
4: 19/05/05 08:25:40 INFO SecurityManager: Changing view acls groups to:
4: 19/05/05 08:25:40 INFO SecurityManager: Changing modify acls groups to:
4: 19/05/05 08:25:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
11: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for TERM
11: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for HUP
11: 19/05/05 08:25:40 INFO SignalUtils: Registered signal handler for INT
11: 19/05/05 08:25:40 INFO SecurityManager: Changing view acls to: moudi
11: 19/05/05 08:25:40 INFO SecurityManager: Changing modify acls to: moudi
11: 19/05/05 08:25:40 INFO SecurityManager: Changing view acls groups to:
11: 19/05/05 08:25:40 INFO SecurityManager: Changing modify acls groups to:
11: 19/05/05 08:25:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(moudi); groups with view permissions: Set(); users with modify permissions: Set(moudi); groups with modify permissions: Set()
4: 19/05/05 08:25:41 INFO Utils: Successfully started service 'sparkWorker' on port 41764.
11: 19/05/05 08:25:41 INFO Utils: Successfully started service 'sparkWorker' on port 42231.
4: 19/05/05 08:25:41 INFO Worker: Starting Spark worker 172.16.138.49:41764 with 5 cores, 15.0 GB RAM
4: 19/05/05 08:25:41 INFO Worker: Running Spark version 2.3.0
4: 19/05/05 08:25:41 INFO Worker: Spark home: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0
0: slurmstepd: error: *** STEP 20562069.0 ON cdr562 CANCELLED AT 2019-05-05T08:25:41 ***
Please, any clarification on why I only have 8 workers?Does my scripts has wrong configuration that it leads to have those 8 created workers?

Unable to start Spark's "start-all.sh" on EC2 (rhel7)

I am trying to run standalone Spark-2.1.1 by triggering /sbin/start-all.sh in an EC2 instance (RHEL 7). Whenever it runs, it asked for the root#localhost's password and even tough I've given the correct password, it throws me - root#localhost's password: localhost: Permission denied, please try again. error.
Irrespective of this error when I hit jps in the console I could see the Master is running.
root#localhost# jps
27863 Master
28093 Jps
Further I checked the logs and found this-
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/12 15:36:15 INFO Master: Started daemon with process name: 27863#localhost.org.xxxxxxxxx.com
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for TERM
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for HUP
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for INT
17/06/12 15:36:15 WARN Utils: Your hostname, localhost.org.xxxxxxxxx.com resolves to a loopback address: 127.0.0.1; using localhost ip instead (on interface eth0)
17/06/12 15:36:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/06/12 15:36:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/12 15:36:16 INFO SecurityManager: Changing view acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing view acls groups to:
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls groups to:
17/06/12 15:36:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/06/12 15:36:16 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
17/06/12 15:36:16 INFO Master: Starting Spark master at spark://localhost.org.xxxxxxxxx.com:7077
17/06/12 15:36:16 INFO Master: Running Spark version 2.1.1
17/06/12 15:36:16 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/06/12 15:36:16 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://localhost:8080
17/06/12 15:36:16 INFO Utils: Successfully started service on port 6066.
17/06/12 15:36:16 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/06/12 15:36:16 INFO Master: I have been elected leader! New state: ALIVE
I am trying to figure out why I am unable to start my worker nodes. Could someone help me out with this ? Thanks.
Check your hostname if it is correctly resolved.
If you're using localhost, make sure it is resolved in your /etc/hosts file.
let me know if this helps. Cheers.

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

I'm trying to setup an 8-node cluster on 8 RHEL 7.3 x86 machines using Spark 2.0.1. start-master.sh goes through fine:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host lambda.foo.net --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:26:46 INFO Master: Started daemon with process name: 22181#lambda.foo.net
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:26:46 INFO SecurityManager: Changing view acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:26:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/12/08 04:26:46 INFO Master: Starting Spark master at spark://lambda.foo.net:7077
16/12/08 04:26:46 INFO Master: Running Spark version 2.0.1
16/12/08 04:26:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/12/08 04:26:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://19.341.11.212:8080
16/12/08 04:26:46 INFO Utils: Successfully started service on port 6066.
16/12/08 04:26:46 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/12/08 04:26:46 INFO Master: I have been elected leader! New state: ALIVE
But when I try to bring up the workers, using start-slaves.sh, what I see in the log of the workers is:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://lambda.foo.net:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:30:00 INFO Worker: Started daemon with process name: 14649#hawk040os4.foo.net
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:30:00 INFO SecurityManager: Changing view acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:30:00 INFO Utils: Successfully started service 'sparkWorker' on port 35858.
16/12/08 04:30:00 INFO Worker: Starting Spark worker 15.242.22.179:35858 with 24 cores, 1510.2 GB RAM
16/12/08 04:30:00 INFO Worker: Running Spark version 2.0.1
16/12/08 04:30:00 INFO Worker: Spark home: /usr/local/bin/spark-2.0.1-bin-hadoop2.7
16/12/08 04:30:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/12/08 04:30:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://15.242.22.179:8081
16/12/08 04:30:00 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:00 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to lambda.foo.net/19.341.11.212:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.NoRouteToHostException: No route to host: lambda.foo.net/19.341.11.212:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/12/08 04:30:12 INFO Worker: Retrying connection to master (attempt # 1)
16/12/08 04:30:12 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:12 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
So it says "No route to host". But I could successfully ping the master from the worker node, as well as ssh from the worker to the master node.
Why does spark say "No route to host"?
Problem solved: the firewall was blocking the packets.

Cannot run spark v2.0.0 example on cluster

So I have set up a Spark cluster. But I can't actually get it to work. When I submit the SparkPi example with:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://x.y.129.163:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 2 \
examples/jars/spark-examples_2.11-2.0.0.jar 1000
I get the following from the worker logs:
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java -cp /opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/* -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://mesos-master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:20:56 INFO Worker: Started daemon with process name: 23949#mesos-slave-4.novalocal
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for TERM
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for HUP
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for INT
16/09/18 09:20:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/18 09:20:56 INFO SecurityManager: Changing view acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:20:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:21:00 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:21:00 INFO Utils: Successfully started service 'sparkWorker' on port 55256.
16/09/18 09:21:00 INFO Worker: Starting Spark worker x.y.129.162:55256 with 4 cores, 6.6 GB RAM
16/09/18 09:21:00 INFO Worker: Running Spark version 2.0.0
16/09/18 09:21:00 INFO Worker: Spark home: /opt/spark/spark-2.0.0-bin-hadoop2.7
16/09/18 09:21:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/09/18 09:21:00 INFO WorkerWebUI: Bound WorkerWebUI to x.y.129.162, and started at http://x.y.129.162:8081
16/09/18 09:21:00 INFO Worker: Connecting to master mesos-master:7077...
16/09/18 09:21:00 INFO TransportClientFactory: Successfully created connection to mesos-master/x.y.129.163:7077 after 33 ms (0 ms spent in bootstraps)
16/09/18 09:21:00 INFO Worker: Successfully registered with master spark://x.y.129.163:7077
16/09/18 09:21:00 INFO Worker: Asked to launch driver driver-20160918090435-0001
16/09/18 09:21:01 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:06 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
16/09/18 09:21:07 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:12 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
i.e. the job/driver appears to be failing and then retrying indefinitely.
When I look at the starter from the driver on the worker node I see:
Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:33364" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
========================================
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:13:18 INFO SecurityManager: Changing view acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:13:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:13:21 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
Exception in thread "main" java.net.BindException: Cannot assign requested address: Service 'Driver' failed after 16 retries! Consider explicitly setting the appropriate port for the service 'Driver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:463)
at sun.nio.ch.Net.bind(Net.java:455)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
(excuse the timestamps, the logs are the same later on)
On the master I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-master
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.155 mesos-slave-1
x.y.129.161 mesos-slave-2
x.y.129.160 mesos-slave-3
x.y.129.162 mesos-slave-4
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.163
And for the worker I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-slave-4 mesos-slave-4.novalocal
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.163 mesos-master
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.162
I also disabled all ipv6 from /etc/sysctl.conf.
All daemons are started with the sbin/start-master.sh and sbin/start-slave.sh spark://x.y.129.163:7077 commands.
Update: so I attempted the spark-submit again, but without the --deploy-mode cluster.... and it works! Any idea why it doesn't with cluster mode?

Resources