Spark Worker Nodes starting but not showing in WebUI - apache-spark

I am trying to setup a small 3 node Spark cluster using some Raspberry Pi's and my main desktop but can't seem to get the Pi's to talk to my master node (the desktop). I have the network configured properly as I am also running Cassandra (Open source not DSE) on all three nodes. If I go to the web UI it only shows my main computer. I can put in the web ui address for each of the worker nodes and get their individual web ui page. They don't seem to know about my master node. I have each of the slave nodes in my slaves file. I feel like I am missing just one small thing to get this to work. Any suggestions would be much appreciated. Below are some logs and any other information I could think of that might be helpful , while trying to keep this fairly short and concise.
My spark-env.shon all the nodes is as follows (except Local IP is adjusted appropriately)
export SPARK_WORKER_CORES=6
export SPARK_MASTER_HOST=192.168.0.106
export SPARK_LOCAL_IP=192.168.0.201
Log from a Worker Node:
Spark Command: /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java -cp /home/spark/spark/conf/:/home/spark/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://Palehorse:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/05 03:22:40 INFO Worker: Started daemon with process name: 11065#PiCamp1
17/07/05 03:22:40 INFO SignalUtils: Registered signal handler for TERM
17/07/05 03:22:40 INFO SignalUtils: Registered signal handler for HUP
17/07/05 03:22:40 INFO SignalUtils: Registered signal handler for INT
17/07/05 03:22:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/05 03:22:42 INFO SecurityManager: Changing view acls to: spark
17/07/05 03:22:42 INFO SecurityManager: Changing modify acls to: spark
17/07/05 03:22:42 INFO SecurityManager: Changing view acls groups to:
17/07/05 03:22:42 INFO SecurityManager: Changing modify acls groups to:
17/07/05 03:22:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
17/07/05 03:22:43 INFO Utils: Successfully started service 'sparkWorker' on port 35342.
17/07/05 03:22:44 INFO Worker: Starting Spark worker 192.168.0.201:35342 with 6 cores, 1024.0 MB RAM
17/07/05 03:22:44 INFO Worker: Running Spark version 2.1.1
17/07/05 03:22:44 INFO Worker: Spark home: /home/spark/spark
17/07/05 03:22:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
17/07/05 03:22:45 INFO WorkerWebUI: Bound WorkerWebUI to 192.168.0.201, and started at http://192.168.0.201:8081
17/07/05 03:22:45 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:22:51 INFO Worker: Retrying connection to master (attempt # 1)
17/07/05 03:22:51 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:22:57 INFO Worker: Retrying connection to master (attempt # 2)
17/07/05 03:22:57 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:23:03 INFO Worker: Retrying connection to master (attempt # 3)
17/07/05 03:23:03 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:23:09 INFO Worker: Retrying connection to master (attempt # 4)
17/07/05 03:23:09 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:23:15 INFO Worker: Retrying connection to master (attempt # 5)
17/07/05 03:23:15 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:23:21 INFO Worker: Retrying connection to master (attempt # 6)
17/07/05 03:23:21 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:23:57 INFO Worker: Retrying connection to master (attempt # 7)
17/07/05 03:23:57 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:24:33 INFO Worker: Retrying connection to master (attempt # 8)
17/07/05 03:24:33 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:24:45 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
17/07/05 03:24:45 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
17/07/05 03:24:45 WARN Worker: Failed to connect to master Palehorse:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:229)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
17/07/05 03:25:09 INFO Worker: Retrying connection to master (attempt # 9)
17/07/05 03:25:09 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:25:45 INFO Worker: Retrying connection to master (attempt # 10)
17/07/05 03:25:45 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:26:21 INFO Worker: Retrying connection to master (attempt # 11)
17/07/05 03:26:21 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:26:57 INFO Worker: Retrying connection to master (attempt # 12)
17/07/05 03:26:57 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:27:09 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
17/07/05 03:27:09 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
17/07/05 03:27:09 WARN Worker: Failed to connect to master Palehorse:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:229)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
17/07/05 03:27:33 INFO Worker: Retrying connection to master (attempt # 13)
17/07/05 03:27:33 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:28:09 INFO Worker: Retrying connection to master (attempt # 14)
17/07/05 03:28:09 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:28:45 INFO Worker: Retrying connection to master (attempt # 15)
17/07/05 03:28:45 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:29:21 INFO Worker: Retrying connection to master (attempt # 16)
17/07/05 03:29:21 INFO Worker: Connecting to master Palehorse:7077...
17/07/05 03:29:33 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
17/07/05 03:29:33 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
17/07/05 03:29:33 WARN Worker: Failed to connect to master Palehorse:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connecting to Palehorse/198.105.254.63:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:229)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
17/07/05 03:29:57 ERROR Worker: All masters are unresponsive! Giving up.

I was able to get the slaves to talk to the master finally. Seemed to be a combination of things, one was a problem with my /etc/hosts file having the master name being set to the 127.0.1.1 address and the other problem did seem to be a problem with the start-all.sh solved by running the start-slave.sh spark://<master ip address>:7077

I was also able to solve this problem by commenting out the weird IP address 127.0.0.1 and 127.0.1.1

Related

Apache spark Worker: Failed to connect to master master:7077 in Google cloud cluster

I am trying to start an apache spark cluster in Google cloud with 1 Master and 4 workers. However, when I run start-all.sh It shows that all nodes are starting without any error.
sparkuser#master:/opt/spark/logs$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
sparkuser#master:/opt/spark/logs$
When I check the log file, the Master is running. But none of the workers are running. What could be the issue?
spark-env.sh for workers
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_202
export SPARK_HOME=/opt/spark/
export SPARK_MASTER_HOST="master"
export SPARK_LOCAL_IP="127.0.0.1"
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_LOG_DIR=/opt/spark/logs
spark-env.sh for master
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_202
export SPARK_HOME=/opt/spark/
export SPARK_MASTER_HOST="master"
export SPARK_LOCAL_IP="127.0.0.1"
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_LOG_DIR=/opt/spark/logs
The error on the log file of worker:
Spark Command: /usr/lib/jvm/jdk1.8.0_202/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/11/07 10:20:13 INFO Worker: Started daemon with process name: 18908#worker2
22/11/07 10:20:13 INFO SignalUtils: Registering signal handler for TERM
22/11/07 10:20:13 INFO SignalUtils: Registering signal handler for HUP
22/11/07 10:20:13 INFO SignalUtils: Registering signal handler for INT
22/11/07 10:20:13 WARN Utils: Your hostname, worker2 resolves to a loopback address: 127.0.0.1; using 10.178.0.5 instead (on interface ens4)
22/11/07 10:20:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/11/07 10:20:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/07 10:20:14 INFO SecurityManager: Changing view acls to: sparkuser
22/11/07 10:20:14 INFO SecurityManager: Changing modify acls to: sparkuser
22/11/07 10:20:14 INFO SecurityManager: Changing view acls groups to:
22/11/07 10:20:14 INFO SecurityManager: Changing modify acls groups to:
22/11/07 10:20:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/11/07 10:20:14 INFO Utils: Successfully started service 'sparkWorker' on port 46335.
22/11/07 10:20:14 INFO Worker: Worker decommissioning not enabled.
22/11/07 10:20:15 INFO Worker: Starting Spark worker 10.178.0.5:46335 with 2 cores, 6.8 GiB RAM
22/11/07 10:20:15 INFO Worker: Running Spark version 3.2.2
22/11/07 10:20:15 INFO Worker: Spark home: /opt/spark
22/11/07 10:20:15 INFO ResourceUtils: ==============================================================
22/11/07 10:20:15 INFO ResourceUtils: No custom resources configured for spark.worker.
22/11/07 10:20:15 INFO ResourceUtils: ==============================================================
22/11/07 10:20:15 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
22/11/07 10:20:15 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://worker2.c.apache-spark-project-363713.internal:8081
22/11/07 10:20:15 INFO Worker: Connecting to master master:7077...
22/11/07 10:20:24 INFO Worker: Retrying connection to master (attempt # 1)
22/11/07 10:20:24 INFO Worker: Connecting to master master:7077...
22/11/07 10:20:33 INFO Worker: Retrying connection to master (attempt # 2)
22/11/07 10:20:33 INFO Worker: Connecting to master master:7077...
22/11/07 10:20:42 INFO Worker: Retrying connection to master (attempt # 3)
22/11/07 10:20:42 INFO Worker: Connecting to master master:7077...
22/11/07 10:20:51 INFO Worker: Retrying connection to master (attempt # 4)
22/11/07 10:20:51 INFO Worker: Connecting to master master:7077...
22/11/07 10:21:00 INFO Worker: Retrying connection to master (attempt # 5)
22/11/07 10:21:00 INFO Worker: Connecting to master master:7077...
22/11/07 10:21:09 INFO Worker: Retrying connection to master (attempt # 6)
22/11/07 10:21:09 INFO Worker: Connecting to master master:7077...
22/11/07 10:22:00 INFO Worker: Retrying connection to master (attempt # 7)
22/11/07 10:22:00 INFO Worker: Connecting to master master:7077...
22/11/07 10:22:15 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
22/11/07 10:22:15 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
22/11/07 10:22:15 WARN Worker: Failed to connect to master master:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at org.apache.spark.deploy.worker.Worker$$anon$1.run(Worker.scala:311)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
... 4 more
22/11/07 10:22:51 INFO Worker: Retrying connection to master (attempt # 8)
22/11/07 10:22:51 INFO Worker: Connecting to master master:7077...
22/11/07 10:23:42 INFO Worker: Retrying connection to master (attempt # 9)
22/11/07 10:23:42 INFO Worker: Connecting to master master:7077...
22/11/07 10:24:33 INFO Worker: Retrying connection to master (attempt # 10)
22/11/07 10:24:33 INFO Worker: Connecting to master master:7077...
22/11/07 10:24:51 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
22/11/07 10:24:51 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
22/11/07 10:24:51 WARN Worker: Failed to connect to master master:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at org.apache.spark.deploy.worker.Worker$$anon$1.run(Worker.scala:311)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
... 4 more
22/11/07 10:25:24 INFO Worker: Retrying connection to master (attempt # 11)
22/11/07 10:25:24 INFO Worker: Connecting to master master:7077...
22/11/07 10:26:15 INFO Worker: Retrying connection to master (attempt # 12)
22/11/07 10:26:15 INFO Worker: Connecting to master master:7077...
22/11/07 10:27:06 INFO Worker: Retrying connection to master (attempt # 13)
22/11/07 10:27:06 INFO Worker: Connecting to master master:7077...
22/11/07 10:27:24 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
22/11/07 10:27:24 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
22/11/07 10:27:24 WARN Worker: Failed to connect to master master:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at org.apache.spark.deploy.worker.Worker$$anon$1.run(Worker.scala:311)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
... 4 more
22/11/07 10:27:57 INFO Worker: Retrying connection to master (attempt # 14)
22/11/07 10:27:57 INFO Worker: Connecting to master master:7077...
22/11/07 10:28:48 INFO Worker: Retrying connection to master (attempt # 15)
22/11/07 10:28:48 INFO Worker: Connecting to master master:7077...
22/11/07 10:29:39 INFO Worker: Retrying connection to master (attempt # 16)
22/11/07 10:29:39 INFO Worker: Connecting to master master:7077...
22/11/07 10:29:57 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
22/11/07 10:29:57 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
22/11/07 10:29:57 WARN Worker: Failed to connect to master master:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at org.apache.spark.deploy.worker.Worker$$anon$1.run(Worker.scala:311)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connecting to master/35.216.27.9:7077 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
... 4 more
22/11/07 10:30:30 ERROR Worker: All masters are unresponsive! Giving up.

Spark fails to register multiple workers to master

I have been working on creating a Spark cluster using 1 master and 4 workers on Linux.
It works fine for one worker. When I try to add more than one worker, only the first worker gets registered to master while the rest fails with the below error,
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/08/06 14:17:39 INFO Worker: Started daemon with process name: 24104#barracuda5
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for TERM
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for HUP
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for INT
18/08/06 14:17:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/06 14:17:39 INFO SecurityManager: Changing view acls to: barracuda5
18/08/06 14:17:39 INFO SecurityManager: Changing modify acls to: barracuda5
18/08/06 14:17:39 INFO SecurityManager: Changing view acls groups to:
18/08/06 14:17:39 INFO SecurityManager: Changing modify acls groups to:
18/08/06 14:17:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(barracuda5); groups with view permissions: Set(); users with modify permissions: Set(barracuda5); groups with modify permissions: Set()
18/08/06 14:17:40 INFO Utils: Successfully started service 'sparkWorker' on port 46635.
18/08/06 14:17:40 INFO Worker: Starting Spark worker 10.0.6.6:46635 with 4 cores, 14.7 GB RAM
18/08/06 14:17:40 INFO Worker: Running Spark version 2.1.0
18/08/06 14:17:40 INFO Worker: Spark home: /usr/lib/spark/spark-2.1.0-bin-hadoop2.7
18/08/06 14:17:40 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
18/08/06 14:17:40 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://10.0.6.6:8081
18/08/06 14:17:40 INFO Worker: Connecting to master Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077...
18/08/06 14:17:40 WARN Worker: Failed to connect to master Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
Let me know if I have missed something here. Or if anyone knows what might be the solution to this.
Thanks

Apache Spark Failed to connect to master localhost:7077

I am very new to Apache Spark and trying to run spark on my local machine.
First I tried to start the master using the following command:
./sbin/start-master.sh
Which got successfully started. And then I tried to start the worker using
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
which eventually failed with the following log:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/09 17:01:58 INFO Worker: Started daemon with process name: 9301#sumit-Inspiron-5537
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for TERM
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for HUP
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for INT
17/06/09 17:01:58 WARN Utils: Your hostname, sumit-Inspiron-5537 resolves to a loopback address: 127.0.1.1; using 192.168.1.16 instead (on interface wlp2s0)
17/06/09 17:01:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/06/09 17:01:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/09 17:01:59 INFO SecurityManager: Changing view acls to: sumit
17/06/09 17:01:59 INFO SecurityManager: Changing modify acls to: sumit
17/06/09 17:01:59 INFO SecurityManager: Changing view acls groups to:
17/06/09 17:01:59 INFO SecurityManager: Changing modify acls groups to:
17/06/09 17:01:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sumit); groups with view permissions: Set(); users with modify permissions: Set(sumit); groups with modify permissions: Set()
17/06/09 17:01:59 INFO Utils: Successfully started service 'sparkWorker' on port 35827.
17/06/09 17:02:00 INFO Worker: Starting Spark worker 192.168.1.16:35827 with 1 cores, 512.0 MB RAM
17/06/09 17:02:00 INFO Worker: Running Spark version 2.1.1
17/06/09 17:02:00 INFO Worker: Spark home: /home/sumit/spark-2.1.1-bin-hadoop2.7
17/06/09 17:02:00 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
17/06/09 17:02:00 WARN Utils: Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
17/06/09 17:02:00 INFO Utils: Successfully started service 'WorkerUI' on port 8083.
17/06/09 17:02:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://192.168.1.16:8083
17/06/09 17:02:00 INFO Worker: Connecting to master localhost:7077...
17/06/09 17:02:00 WARN Worker: Failed to connect to master localhost:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
This has gone into attempts loop. I had checked localhost on 8080 and Master is working properly. Please suggest what can be done in this situation to get the slave up and working, because only then the spark job can be run. Thank you.

Spark standalone, worker failed to connect to master

I'm in trouble with Spark. I have a Spark standalone cluster with 2 nodes,
master: 121.*.*.22(hostname is iZ28i1niuigZ)
worker: 123.*.*.125(hostname is VM-120-50-ubuntu).
I have edited slaves file and added 123.*.*.125.
There is no worker info on WebUI:
WebUI image of spark master
When executing the start script I see the following messages:
spark#iZ28i1niuigZ:~/spark-2.0.1-bin-hadoop2.7$ sh sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.master.Master-1-iZ28i1niuigZ.out
123.*.*.125: starting org.apache.spark.deploy.worker.Worker, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-VM-120-50-ubuntu.out
The spark-env.sh file contents are:
export SPARK_MASTER_IP=121.*.*.22
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORDER_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
export JAVA_HOME=/home/spark/jdk1.8.0_101
On the worker I can see the following output:
Spark Command: /home/spark/jdk1.8.0_101/bin/java -cp /home/spark/spark-2.0.1-bin-hadoop2.7/conf/:/home/spark/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://iZ28i1niuigZ:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/30 20:04:56 INFO Worker: Started daemon with process name: 28287#VM-120-50-ubuntu
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for TERM
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for HUP
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for INT
16/11/30 20:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/30 20:04:56 INFO SecurityManager: Changing view acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing view acls groups to:
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls groups to:
16/11/30 20:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
16/11/30 20:04:57 INFO Utils: Successfully started service 'sparkWorker' on port 41544.
16/11/30 20:04:57 INFO Worker: Starting Spark worker 10.141.120.50:41544 with 1 cores, 1024.0 MB RAM
16/11/30 20:04:57 INFO Worker: Running Spark version 2.0.1
16/11/30 20:04:57 INFO Worker: Spark home: /home/spark/spark-2.0.1-bin-hadoop2.7
16/11/30 20:04:57 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/11/30 20:04:57 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://10.141.120.50:8081
16/11/30 20:04:57 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:04:58 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.jav a:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/11/30 20:05:08 INFO Worker: Retrying connection to master (attempt # 1)
16/11/30 20:05:08 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:05:08 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout .scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
The /etc/hosts on master node looks like:
127.0.0.1 localhost
127.0.1.1 localhost.localdomain localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.251.33.226 iZ28i1niuigZ
123.*.*.125 VM-120-50-ubuntu
And the /etc/hosts worker node contains the following configurations:
10.141.120.50 VM-120-50-ubuntu
127.0.0.1 localhost localhost.localdomain
121.*.*.22 iZ28i1niuigZ
I cannot understand why the worker is unable to connect to master?
========================================================================
update:
I cannot telnet 123.*.*.125 7077, while I can telnet 123.*.*.125
When executing the command: iptables -L -n, I see the following messages:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Spark worker can not connect to Master

While starting the worker node I get the following error :
Spark Command: /usr/lib/jvm/default-java/bin/java -cp /home/ubuntu/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ip-1-70-44-5:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/16 19:19:10 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/10/16 19:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/16 19:19:11 INFO SecurityManager: Changing view acls to: ubuntu
15/10/16 19:19:11 INFO SecurityManager: Changing modify acls to: ubuntu
15/10/16 19:19:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
15/10/16 19:19:12 INFO Slf4jLogger: Slf4jLogger started
15/10/16 19:19:12 INFO Remoting: Starting remoting
15/10/16 19:19:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#1.70.44.4:55126]
15/10/16 19:19:12 INFO Utils: Successfully started service 'sparkWorker' on port 55126.
15/10/16 19:19:12 INFO Worker: Starting Spark worker 1.70.44.4:55126 with 2 cores, 2.9 GB RAM
15/10/16 19:19:12 INFO Worker: Running Spark version 1.5.1
15/10/16 19:19:12 INFO Worker: Spark home: /home/ubuntu/spark-1.5.1-bin-hadoop2.6
15/10/16 19:19:12 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/10/16 19:19:12 INFO WorkerWebUI: Started WorkerWebUI at http://1.70.44.4:8081
15/10/16 19:19:12 INFO Worker: Connecting to master ip-1-70-44-5:7077...
15/10/16 19:19:24 INFO Worker: Retrying connection to master (attempt # 1)
15/10/16 19:19:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-5,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1c5651e9 rejected from java.util.concurrent.ThreadPoolExecutor#671ba687[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:211)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:210)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters(Worker.scala:210)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:288)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/10/16 19:19:24 INFO ShutdownHookManager: Shutdown hook called
I have added the hostnames to the conf/slaves file. I dont know which enviroment variables to set in spark-env.sh so right not its not being used.
Any pointers to the solution ?
Also, if I should use spark-env.sh then which enviroment vvariables should I run ?
setup details :
2 ubuntu14 machines having 2 cores each.
Please advise.
thanks
So, after some tinkering around I found that slave was not able to communicate with Master on the given port. I changed the security access rules and enabled all TCP traffic on all ports . This solved the problem.
To check if the port is open :
telnet master.ip master.port
The default port is 7077.
My spark-env.sh :
export SPARK_WORKER_INSTANCES=2
export SPARK_MASTER_IP=<ip address>
I'm afraid your hostname may be invalid to Spark, and you hava to change your spark-env.sh.
You can set the variable SPARK_MASTER_IP to be the real ip of master, instead of its hostname.
e.g.
export SPARK_MASTER_IP=1.70.44.5
INSTEAD OF
export SPARK_MASTER_IP=ip-1-70-44-5

Resources