Unable to start Spark's "start-all.sh" on EC2 (rhel7)

Unable to start Spark's "start-all.sh" on EC2 (rhel7) - apache-spark

I am trying to run standalone Spark-2.1.1 by triggering /sbin/start-all.sh in an EC2 instance (RHEL 7). Whenever it runs, it asked for the root#localhost's password and even tough I've given the correct password, it throws me - root#localhost's password: localhost: Permission denied, please try again. error.
Irrespective of this error when I hit jps in the console I could see the Master is running.
root#localhost# jps
27863 Master
28093 Jps
Further I checked the logs and found this-
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/12 15:36:15 INFO Master: Started daemon with process name: 27863#localhost.org.xxxxxxxxx.com
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for TERM
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for HUP
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for INT
17/06/12 15:36:15 WARN Utils: Your hostname, localhost.org.xxxxxxxxx.com resolves to a loopback address: 127.0.0.1; using localhost ip instead (on interface eth0)
17/06/12 15:36:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/06/12 15:36:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/12 15:36:16 INFO SecurityManager: Changing view acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing view acls groups to:
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls groups to:
17/06/12 15:36:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/06/12 15:36:16 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
17/06/12 15:36:16 INFO Master: Starting Spark master at spark://localhost.org.xxxxxxxxx.com:7077
17/06/12 15:36:16 INFO Master: Running Spark version 2.1.1
17/06/12 15:36:16 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/06/12 15:36:16 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://localhost:8080
17/06/12 15:36:16 INFO Utils: Successfully started service on port 6066.
17/06/12 15:36:16 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/06/12 15:36:16 INFO Master: I have been elected leader! New state: ALIVE
I am trying to figure out why I am unable to start my worker nodes. Could someone help me out with this ? Thanks.

Check your hostname if it is correctly resolved.
If you're using localhost, make sure it is resolved in your /etc/hosts file.
let me know if this helps. Cheers.

Related

How to solve "no org.apache.spark.deploy.worker.Worker to stop" issue?

I a using a spark standalone in Google Cloud, composed of 1 master and 4 worker nodes. When I start the cluster. I can see the master and worker running. But when I try to stop-all, I get the following issue. Maybe this the reason I cannot run spark-submit. How to solve this issue. The following are the terminal screen.
sparkuser#master:/opt/spark/logs$ jps
1867 Jps
sparkuser#master:/opt/spark/logs$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
sparkuser#master:/opt/spark/logs$ jps -lm
1946 sun.tools.jps.Jps -lm
1886 org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
sparkuser#master:/opt/spark/logs$ cat spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
Spark Command: /usr/lib/jvm/jdk1.8.0_202/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/13 04:28:23 INFO Master: Started daemon with process name: 1886#master
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for TERM
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for HUP
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for INT
22/10/13 04:28:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/13 04:28:24 INFO SecurityManager: Changing view acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing view acls groups to:
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls groups to:
22/10/13 04:28:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/10/13 04:28:24 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/10/13 04:28:24 INFO Master: Starting Spark master at spark://master:7077
22/10/13 04:28:24 INFO Master: Running Spark version 3.2.2
22/10/13 04:28:25 INFO Utils: Successfully started service 'MasterUI' on port 8080.
22/10/13 04:28:25 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://localhost:8080
22/10/13 04:28:25 INFO Master: I have been elected leader! New state: ALIVE
sparkuser#master:/opt/spark/logs$ stop-all.sh
worker2: no org.apache.spark.deploy.worker.Worker to stop
worker4: no org.apache.spark.deploy.worker.Worker to stop
worker1: no org.apache.spark.deploy.worker.Worker to stop
worker3: no org.apache.spark.deploy.worker.Worker to stop
stopping org.apache.spark.deploy.master.Master
sparkuser#master:/opt/spark/logs$

How to fix the "ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]" in Google cloud VMs

I am trying to create a cluster of VM instances in Google cloud. There are 4 worker nodes and 1 master node.
Things that I have configured:
Created "sparkuser" and given sudo privileges
Installed same version of Java JDK and JRE in all machines and configured the path.
Installed same version of Scala and sparks.
Hosts file and host name added, able to ssh between each machines.
Configured the "spark-env.sh" and "slaves" file in spark on each machines
However, when I try to run this bash command "start-master.sh" it starts all the VM's spark in cluster. But with the jps command I cannot see any master and workers, on checking the file in: /spark/log
The log file contains the error and I tried to solve it with various ways found in the developers' community. Unfortunately, I am still not able to solve the issue:
I am adding the log file here:
sparkuser#master:~$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
sparkuser#master:~$ jps
3280 Jps
sparkuser#master:~$ cat /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.6
cat: /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.6: No such file or directory
sparkuser#master:~$ cat /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.5
Spark Command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 35.216.27.9 --port 7100 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/09/30 07:09:21 INFO Master: Started daemon with process name: 3913#master
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for TERM
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for HUP
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for INT
22/09/30 07:09:22 WARN Utils: Your hostname, master resolves to a loopback address: 127.0.0.1; using 10.178.0.3 instead (on interface ens4)
22/09/30 07:09:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/09/30 07:09:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/30 07:09:22 INFO SecurityManager: Changing view acls to: sparkuser
22/09/30 07:09:22 INFO SecurityManager: Changing modify acls to: sparkuser
22/09/30 07:09:22 INFO SecurityManager: Changing view acls groups to:
22/09/30 07:09:22 INFO SecurityManager: Changing modify acls groups to:
22/09/30 07:09:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7100. Attempting port 7101.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7101. Attempting port 7102.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7102. Attempting port 7103.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7103. Attempting port 7104.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7104. Attempting port 7105.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7105. Attempting port 7106.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7106. Attempting port 7107.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7107. Attempting port 7108.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7108. Attempting port 7109.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7109. Attempting port 7110.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7110. Attempting port 7111.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7111. Attempting port 7112.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7112. Attempting port 7113.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7113. Attempting port 7114.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7114. Attempting port 7115.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7115. Attempting port 7116.
22/09/30 07:09:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.net.BindException: Cannot assign requested address: Service 'sparkMaster' failed after 16 retries (starting from 7100)! Consider explicitly setting the appropriate port for the service 'sparkMaster' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Net.java:459)
at java.base/sun.nio.ch.Net.bind(Net.java:448)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
22/09/30 07:09:23 INFO ShutdownHookManager: Shutdown hook called

On spark/conf/spark-env.sh file add these following:
export SPARK_LOCAL_IP="127.0.0.1"
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WOKER_DIR=/opt/spark/conf/slaves "user-case based path"
export SPARK_LOG_DIR=/opt/spark/logs
Along with that please ensure that you are able to SSH between all machines.
If you run scp among the machines and it runs without any error then the cluster will start. If SSH is working, but SCP is not working then remove the pub_keys and start over the key exchange process.
I hope this works.
It worked for me.

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

I'm trying to setup an 8-node cluster on 8 RHEL 7.3 x86 machines using Spark 2.0.1. start-master.sh goes through fine:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host lambda.foo.net --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:26:46 INFO Master: Started daemon with process name: 22181#lambda.foo.net
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:26:46 INFO SecurityManager: Changing view acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:26:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/12/08 04:26:46 INFO Master: Starting Spark master at spark://lambda.foo.net:7077
16/12/08 04:26:46 INFO Master: Running Spark version 2.0.1
16/12/08 04:26:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/12/08 04:26:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://19.341.11.212:8080
16/12/08 04:26:46 INFO Utils: Successfully started service on port 6066.
16/12/08 04:26:46 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/12/08 04:26:46 INFO Master: I have been elected leader! New state: ALIVE
But when I try to bring up the workers, using start-slaves.sh, what I see in the log of the workers is:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://lambda.foo.net:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:30:00 INFO Worker: Started daemon with process name: 14649#hawk040os4.foo.net
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:30:00 INFO SecurityManager: Changing view acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:30:00 INFO Utils: Successfully started service 'sparkWorker' on port 35858.
16/12/08 04:30:00 INFO Worker: Starting Spark worker 15.242.22.179:35858 with 24 cores, 1510.2 GB RAM
16/12/08 04:30:00 INFO Worker: Running Spark version 2.0.1
16/12/08 04:30:00 INFO Worker: Spark home: /usr/local/bin/spark-2.0.1-bin-hadoop2.7
16/12/08 04:30:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/12/08 04:30:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://15.242.22.179:8081
16/12/08 04:30:00 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:00 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to lambda.foo.net/19.341.11.212:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.NoRouteToHostException: No route to host: lambda.foo.net/19.341.11.212:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/12/08 04:30:12 INFO Worker: Retrying connection to master (attempt # 1)
16/12/08 04:30:12 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:12 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
So it says "No route to host". But I could successfully ping the master from the worker node, as well as ssh from the worker to the master node.
Why does spark say "No route to host"?

Problem solved: the firewall was blocking the packets.

Cannot run spark v2.0.0 example on cluster

So I have set up a Spark cluster. But I can't actually get it to work. When I submit the SparkPi example with:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://x.y.129.163:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 2 \
examples/jars/spark-examples_2.11-2.0.0.jar 1000
I get the following from the worker logs:
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java -cp /opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/* -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://mesos-master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:20:56 INFO Worker: Started daemon with process name: 23949#mesos-slave-4.novalocal
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for TERM
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for HUP
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for INT
16/09/18 09:20:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/18 09:20:56 INFO SecurityManager: Changing view acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:20:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:21:00 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:21:00 INFO Utils: Successfully started service 'sparkWorker' on port 55256.
16/09/18 09:21:00 INFO Worker: Starting Spark worker x.y.129.162:55256 with 4 cores, 6.6 GB RAM
16/09/18 09:21:00 INFO Worker: Running Spark version 2.0.0
16/09/18 09:21:00 INFO Worker: Spark home: /opt/spark/spark-2.0.0-bin-hadoop2.7
16/09/18 09:21:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/09/18 09:21:00 INFO WorkerWebUI: Bound WorkerWebUI to x.y.129.162, and started at http://x.y.129.162:8081
16/09/18 09:21:00 INFO Worker: Connecting to master mesos-master:7077...
16/09/18 09:21:00 INFO TransportClientFactory: Successfully created connection to mesos-master/x.y.129.163:7077 after 33 ms (0 ms spent in bootstraps)
16/09/18 09:21:00 INFO Worker: Successfully registered with master spark://x.y.129.163:7077
16/09/18 09:21:00 INFO Worker: Asked to launch driver driver-20160918090435-0001
16/09/18 09:21:01 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:06 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
16/09/18 09:21:07 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:12 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
i.e. the job/driver appears to be failing and then retrying indefinitely.
When I look at the starter from the driver on the worker node I see:
Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:33364" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
========================================
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:13:18 INFO SecurityManager: Changing view acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:13:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:13:21 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
Exception in thread "main" java.net.BindException: Cannot assign requested address: Service 'Driver' failed after 16 retries! Consider explicitly setting the appropriate port for the service 'Driver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:463)
at sun.nio.ch.Net.bind(Net.java:455)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
(excuse the timestamps, the logs are the same later on)
On the master I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-master
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.155 mesos-slave-1
x.y.129.161 mesos-slave-2
x.y.129.160 mesos-slave-3
x.y.129.162 mesos-slave-4
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.163
And for the worker I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-slave-4 mesos-slave-4.novalocal
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.163 mesos-master
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.162
I also disabled all ipv6 from /etc/sysctl.conf.
All daemons are started with the sbin/start-master.sh and sbin/start-slave.sh spark://x.y.129.163:7077 commands.
Update: so I attempted the spark-submit again, but without the --deploy-mode cluster.... and it works! Any idea why it doesn't with cluster mode?

Setting Spark master ip #

I have a Spark workers which can't connect to its master because of an IP issue.
On the start-all.sh on the master (which name is 'pl'), I get the following on the slave log :
16/02/12 21:28:35 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.0.38:8081
16/02/12 21:28:35 INFO Worker: Connecting to master pl:7077...
16/02/12 21:28:35 WARN Worker: Failed to connect to master pl:7077
java.io.IOException: Failed to connect to pl/192.168.0.39:7077
Here is my /etc/hosts file :
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 wk
192.168.0.39 pl
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
It seems like spark worker is confused between the master names and IP address...How should I set up this ?
Another question is : looking at the master's logs, it seems that the master is listening on another port (7078) than the one the worker is trying to reach (7077) because of a failure to start on the 1st port tried.
romain#pl:~/spark-1.6.0-bin-hadoop2.6/logs$ cat spark-romain-org.apache.spark.deploy.master.Master-1-pl.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /home/romain/spark-1.6.0-bin-hadoop2.6/conf/:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip pl --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/12 21:28:35 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/02/12 21:28:35 WARN Utils: Your hostname, pl resolves to a loopback address: 127.0.1.1; using 192.168.0.39 instead (on interface eth0)
16/02/12 21:28:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/12 21:28:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/12 21:28:35 INFO SecurityManager: Changing view acls to: romain
16/02/12 21:28:35 INFO SecurityManager: Changing modify acls to: romain
16/02/12 21:28:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(romain); users with modify permissions: Set(romain)
16/02/12 21:28:36 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
16/02/12 21:28:36 INFO Utils: Successfully started service 'sparkMaster' on port 7078.
16/02/12 21:28:36 INFO Master: Starting Spark master at spark://pl:7078
16/02/12 21:28:36 INFO Master: Running Spark version 1.6.0
16/02/12 21:28:36 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
16/02/12 21:28:36 WARN Utils: Service 'MasterUI' could not bind on port 8081. Attempting port 8082.
16/02/12 21:28:36 INFO Utils: Successfully started service 'MasterUI' on port 8082.
16/02/12 21:28:36 INFO MasterWebUI: Started MasterWebUI at http://192.168.0.39:8082
16/02/12 21:28:36 WARN Utils: Service could not bind on port 6066. Attempting port 6067.
16/02/12 21:28:36 INFO Utils: Successfully started service on port 6067.
16/02/12 21:28:36 INFO StandaloneRestServer: Started REST server for submitting applications on port 6067
16/02/12 21:28:36 INFO Master: I have been elected leader! New state: ALIVE
But what is strange is that the local worker logs as if successusfully connected to the local master on port :
16/02/12 21:28:38 INFO Worker: Connecting to master pl:7077...
16/02/12 21:28:38 INFO Worker: Successfully registered with master spark://pl:7077

You can try running netstat -pna | grep 7077 (needs root privileges) on the master to see what process is blocking the port.
Maybe you have another driver instance running. If this is a Java process blocking the port you can use jps to find out more about it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string