I'm in trouble with Spark. I have a Spark standalone cluster with 2 nodes,
master: 121.*.*.22(hostname is iZ28i1niuigZ)
worker: 123.*.*.125(hostname is VM-120-50-ubuntu).
I have edited slaves file and added 123.*.*.125.
There is no worker info on WebUI:
WebUI image of spark master
When executing the start script I see the following messages:
spark#iZ28i1niuigZ:~/spark-2.0.1-bin-hadoop2.7$ sh sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.master.Master-1-iZ28i1niuigZ.out
123.*.*.125: starting org.apache.spark.deploy.worker.Worker, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-VM-120-50-ubuntu.out
The spark-env.sh file contents are:
export SPARK_MASTER_IP=121.*.*.22
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORDER_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
export JAVA_HOME=/home/spark/jdk1.8.0_101
On the worker I can see the following output:
Spark Command: /home/spark/jdk1.8.0_101/bin/java -cp /home/spark/spark-2.0.1-bin-hadoop2.7/conf/:/home/spark/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://iZ28i1niuigZ:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/30 20:04:56 INFO Worker: Started daemon with process name: 28287#VM-120-50-ubuntu
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for TERM
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for HUP
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for INT
16/11/30 20:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/30 20:04:56 INFO SecurityManager: Changing view acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing view acls groups to:
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls groups to:
16/11/30 20:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
16/11/30 20:04:57 INFO Utils: Successfully started service 'sparkWorker' on port 41544.
16/11/30 20:04:57 INFO Worker: Starting Spark worker 10.141.120.50:41544 with 1 cores, 1024.0 MB RAM
16/11/30 20:04:57 INFO Worker: Running Spark version 2.0.1
16/11/30 20:04:57 INFO Worker: Spark home: /home/spark/spark-2.0.1-bin-hadoop2.7
16/11/30 20:04:57 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/11/30 20:04:57 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://10.141.120.50:8081
16/11/30 20:04:57 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:04:58 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.jav a:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/11/30 20:05:08 INFO Worker: Retrying connection to master (attempt # 1)
16/11/30 20:05:08 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:05:08 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout .scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
The /etc/hosts on master node looks like:
127.0.0.1 localhost
127.0.1.1 localhost.localdomain localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.251.33.226 iZ28i1niuigZ
123.*.*.125 VM-120-50-ubuntu
And the /etc/hosts worker node contains the following configurations:
10.141.120.50 VM-120-50-ubuntu
127.0.0.1 localhost localhost.localdomain
121.*.*.22 iZ28i1niuigZ
I cannot understand why the worker is unable to connect to master?
========================================================================
update:
I cannot telnet 123.*.*.125 7077, while I can telnet 123.*.*.125
When executing the command: iptables -L -n, I see the following messages:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Related
Server 1 : Master, Slave Node
Server 2 : Slave Node
Server 3 : Slave Node
When I execute the pi.py example to master node, many jobs were finished with Exit code 1.
Same goes for the log message in workernode, like below.
However, I don't know the exact reason... Could you give me some advise???
20/03/12 13:21:54 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 7571#803acbaf5fbf
20/03/12 13:21:54 INFO SignalUtils: Registered signal handler for TERM
20/03/12 13:21:54 INFO SignalUtils: Registered signal handler for HUP
20/03/12 13:21:54 INFO SignalUtils: Registered signal handler for INT
20/03/12 13:21:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/12 13:21:55 INFO SecurityManager: Changing view acls to: spark,root
20/03/12 13:21:55 INFO SecurityManager: Changing modify acls to: spark,root
20/03/12 13:21:55 INFO SecurityManager: Changing view acls groups to:
20/03/12 13:21:55 INFO SecurityManager: Changing modify acls groups to:
20/03/12 13:21:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, root); groups with view permissions: Set(); users with modify permissions: Set(spark, root); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:285)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to 67f75f899bac:43487
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: 67f75f899bac
at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:202)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:48)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:182)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:168)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:985)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:505)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:416)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:475)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Could you give me some advise???
This means that the worker nodes are not able to reach the master node. Because you are using spark inside docker, docker containers both workers and Master should be able to communicate. Update /etc/hosts for all nodes with correct Ip addresses.
You can also update the docker host /etc/hosts and attach it as a volume with -v /etc/hosts:/etc/hosts:rw inside the container.
Add --network host to your run command to allow port mapping with the docker host. Change the hostname of the containers.
Add -e SPARK_MASTER_URL=spark://YOUR_HOST:7077 to your run command
I founded this thread with a similar problem.
Could you check your acls ?
I have been working on creating a Spark cluster using 1 master and 4 workers on Linux.
It works fine for one worker. When I try to add more than one worker, only the first worker gets registered to master while the rest fails with the below error,
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/08/06 14:17:39 INFO Worker: Started daemon with process name: 24104#barracuda5
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for TERM
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for HUP
18/08/06 14:17:39 INFO SignalUtils: Registered signal handler for INT
18/08/06 14:17:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/06 14:17:39 INFO SecurityManager: Changing view acls to: barracuda5
18/08/06 14:17:39 INFO SecurityManager: Changing modify acls to: barracuda5
18/08/06 14:17:39 INFO SecurityManager: Changing view acls groups to:
18/08/06 14:17:39 INFO SecurityManager: Changing modify acls groups to:
18/08/06 14:17:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(barracuda5); groups with view permissions: Set(); users with modify permissions: Set(barracuda5); groups with modify permissions: Set()
18/08/06 14:17:40 INFO Utils: Successfully started service 'sparkWorker' on port 46635.
18/08/06 14:17:40 INFO Worker: Starting Spark worker 10.0.6.6:46635 with 4 cores, 14.7 GB RAM
18/08/06 14:17:40 INFO Worker: Running Spark version 2.1.0
18/08/06 14:17:40 INFO Worker: Spark home: /usr/lib/spark/spark-2.1.0-bin-hadoop2.7
18/08/06 14:17:40 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
18/08/06 14:17:40 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://10.0.6.6:8081
18/08/06 14:17:40 INFO Worker: Connecting to master Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077...
18/08/06 14:17:40 WARN Worker: Failed to connect to master Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to Cudatest.533gwuzexxzehbkoeqpn4rgs4d.ux.internal.cloudapp.net:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
Let me know if I have missed something here. Or if anyone knows what might be the solution to this.
Thanks
I am very new to Apache Spark and trying to run spark on my local machine.
First I tried to start the master using the following command:
./sbin/start-master.sh
Which got successfully started. And then I tried to start the worker using
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
which eventually failed with the following log:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/09 17:01:58 INFO Worker: Started daemon with process name: 9301#sumit-Inspiron-5537
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for TERM
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for HUP
17/06/09 17:01:58 INFO SignalUtils: Registered signal handler for INT
17/06/09 17:01:58 WARN Utils: Your hostname, sumit-Inspiron-5537 resolves to a loopback address: 127.0.1.1; using 192.168.1.16 instead (on interface wlp2s0)
17/06/09 17:01:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/06/09 17:01:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/09 17:01:59 INFO SecurityManager: Changing view acls to: sumit
17/06/09 17:01:59 INFO SecurityManager: Changing modify acls to: sumit
17/06/09 17:01:59 INFO SecurityManager: Changing view acls groups to:
17/06/09 17:01:59 INFO SecurityManager: Changing modify acls groups to:
17/06/09 17:01:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sumit); groups with view permissions: Set(); users with modify permissions: Set(sumit); groups with modify permissions: Set()
17/06/09 17:01:59 INFO Utils: Successfully started service 'sparkWorker' on port 35827.
17/06/09 17:02:00 INFO Worker: Starting Spark worker 192.168.1.16:35827 with 1 cores, 512.0 MB RAM
17/06/09 17:02:00 INFO Worker: Running Spark version 2.1.1
17/06/09 17:02:00 INFO Worker: Spark home: /home/sumit/spark-2.1.1-bin-hadoop2.7
17/06/09 17:02:00 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
17/06/09 17:02:00 WARN Utils: Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
17/06/09 17:02:00 INFO Utils: Successfully started service 'WorkerUI' on port 8083.
17/06/09 17:02:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://192.168.1.16:8083
17/06/09 17:02:00 INFO Worker: Connecting to master localhost:7077...
17/06/09 17:02:00 WARN Worker: Failed to connect to master localhost:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
This has gone into attempts loop. I had checked localhost on 8080 and Master is working properly. Please suggest what can be done in this situation to get the slave up and working, because only then the spark job can be run. Thank you.
I have a problem when to connect with spark cluster.
My application(driver) runs on local env and spark cluster run on cloud. If my application starts, it success to connect with master but fails to connect with executor. I think it's network problem like acl. I can't solve it.
Please help me.
This is Error logs
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/14 18:57:25 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 46530#ambari004-airshow-jp2p-dev.lineinfra-dev.com
17/06/14 18:57:25 INFO SignalUtils: Registered signal handler for TERM
17/06/14 18:57:25 INFO SignalUtils: Registered signal handler for HUP
17/06/14 18:57:25 INFO SignalUtils: Registered signal handler for INT
17/06/14 18:57:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/14 18:57:26 INFO SecurityManager: Changing view acls to: irteam,dongyoung
17/06/14 18:57:26 INFO SecurityManager: Changing modify acls to: irteam,dongyoung
17/06/14 18:57:26 INFO SecurityManager: Changing view acls groups to:
17/06/14 18:57:26 INFO SecurityManager: Changing modify acls groups to:
17/06/14 18:57:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(irteam, dongyoung); groups with view permissions: Set(); users with modify permissions: Set(irteam, dongyoung); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:174)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:270)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
... 4 more
Caused by: java.io.IOException: Failed to connect to /10.70.22.192:59291
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection timed out: /10.70.22.192:59291
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
`
This is user permission issue, atleast log says so.
You should start your spark job from your on-prem driver node with an user-id that has access to the cluster.
Use hdfs/spark level user to trigger your job.
I have a spark cluster (7*2 cores) which is set up on spark 2.0.2, next to an hdfs cluster.
When I use Jupyter to read some hdfs file, I see the app firing up, using 14 cores and 3 , but all the worker fail to launch any task because of a network impossibility to connect to a strange "localhost" port 35529.
spark = SparkSession.builder.master(master).appName(appName).config("spark.executor.instances", 3).getOrCreate()
sc = spark.sparkContext
hdfs_master = "hdfs://xx.xx.xx.xx:8020"
hdfs_path = "/logs/cycliste_debug/2017/2017_02/2017_02_20/23h/*"
infos = sc.textFile(hdfs_master+hdfs_path)
I see :
(that make me think that it is strange to see 14 cores allocated when only 3*2 are possible : ie spark.executor.instances * nb of cpu by node) :
Here is the cluster summary :
Executor Summary for app-20170227140938-0009 :
ExecutorID Worker Cores Memory State ▾ Logs
1488 worker-20170227125912-xx.xx.xx.xx-38028 2 1024 RUNNING stdout stderr
1489 worker-20170227125954-xx.xx.xx.xx-48962 2 1024 RUNNING stdout stderr
5 worker-20170227125959-xx.xx.xx.xx-48149 2 1024 RUNNING stdout stderr
1486 worker-20170227130012-xx.xx.xx.xx-47639 2 1024 RUNNING stdout stderr
1490 worker-20170227130027-xx.xx.xx.xx-44921 2 1024 RUNNING stdout stderr
1485 worker-20170227130152-xx.xx.xx.xx-50620 2 1024 RUNNING stdout stderr
1487 worker-20170227130248-xx.xx.xx.xx-42100 2 1024 RUNNING stdout stderr
and an example of error for one worker :
stderr log page for app-20170227140938-0009/1488:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/27 14:37:57 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 5864#vpsxxx.ovh.net
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for TERM
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for HUP
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for INT
17/02/27 14:37:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/27 14:37:58 INFO SecurityManager: Changing view acls to: spark
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls to: spark
17/02/27 14:37:58 INFO SecurityManager: Changing view acls groups to:
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls groups to:
17/02/27 14:37:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
17/02/27 14:38:01 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:174)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:270)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:35529
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:35529
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
I understand there is a simple communication issue between two processes.
So I display the /etc/hosts :
127.0.0.1 localhost
193.xx.xx.xxx vpsxxxx.ovh.net vpsxxxx
Any idea ?
Check if SPARK_LOCAL_IP is set to the correct IP in each slave.