How to handle dynamic port in Apache Spark / Hadoop Yarn - apache-spark

I have setup a Hadoop cluster with 1 name node and 2 data nodes. I've also installed Yarn and Spark on top of that in the name node.
I notice that whenever I try run the example jar here:
spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_*.jar 10
I will always get the no route to host exception:
Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:558)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:277)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:926)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:925)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:925)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:957)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.io.IOException: Failed to connect to lnx-pen205/xx.xx.xx.xx:9222
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: lnx-pen205/xx.xx.xx.xx:9222
Caused by: java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
I noticed that the port being used will be randomly assigned during the runtime, the example .jar will work if for example I set the spark.driver.port as 9222 then opening said port with the firewall. But then if any other session is started (for example, pyspark shell), it wouldn't start as the port is already in use.
My question is: How do I allow connections to the ports dynamically defined by Spark/Yarn? I read somewhere that I should disable the firewall, but that does not sound like a good idea.. Thanks in advance.

There's spark.driver.port as well as spark.driver.blockManager.port. Both are starting ranges to spark.port.maxRetries (default 16).
So, you'll need to open at least 32 ports for these.
I did some testing with dynamic Spark ports in Mesos + Docker a few years ago - https://stackoverflow.com/a/56486271/2308683

Related

Apache Spark Cluster\Windows: Getting Connection refused: no further information from worker node

I need help please getting spark working in windows cluster of 3 nodes. I am able to download and configure and run the master node and worker nodes. the worker nodes are registered successfully with the master. Im able to see both worker nodes in the Master UI. When I try to submit a job using:
spark-submit --master spark://IP:7077 hello_world.py
Spark continuously try to start multiple executors but the all failed with code exit 1 and it doesnt stop until I kill it. when I check the log in the UI for each worker Im seeing the following error:
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:424)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:413)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:444)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
**Caused by: java.io.IOException: Failed to connect to <Master DNS>/<Master IP>:56785
at **org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: <Master DNS>/<Master IP>:56785
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Im using Spark: spark-3.3.1-bin-hadoop3
Please help.
Thanks
The application to run

Ignite Yarn and Hortonworks

I'm trying to deploy ignite so that I can use the shared RDD/Dataframe cache for my spark cluster. I've followed the spark install instructions and choose to deploy into my existing yarn cluster running spark. I'm using HDP to deploy spark.
I've already verified that Resource Manager and History server are listening on the ports below and I can telnet to each port. What am I doing wrong? Am I not deploying this the way it is intended?
I'm running:
yarn jar ignite-yarn-2.6.0.jar ./ignite-yarn-2.6.0.jar ../../../cluster.properties
Error below:
18/09/24 22:13:38 INFO client.RMProxy: Connecting to ResourceManager at dev01clus02.dna.local/172.31.31.5:8050
18/09/24 22:13:38 INFO client.AHSProxy: Connecting to Application History server at dev01clus02.dna.local/172.31.31.5:10200
Exception in thread "main" java.lang.RuntimeException: Failed update ignite.
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:243)
at org.apache.ignite.yarn.IgniteProvider.getIgnite(IgniteProvider.java:93)
at org.apache.ignite.yarn.IgniteYarnClient.getIgnite(IgniteYarnClient.java:194)
at org.apache.ignite.yarn.IgniteYarnClient.main(IgniteYarnClient.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:675)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:220)
... 9 more
It looks like a piece of insfrastructure, provided by GridGain for Apache Ignite project, does not work currently. I'll raise the issue.
In the meantime, you can provide IGNITE_PATH property (in config, system properties or env) pointed to unzipped Apache Ignite 2.6 distribution directory to avoid downloading attempts altogether.

spark-submit unable to connect

After running the command
spark-submit --class org.apache.spark.examples.SparkPi --proxy-user yarn --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue default ./examples/jars/spark-examples_2.11-2.3.0.jar 10000
I get this in the output and it keeps on retrying. Where am I going wrong? Am I missing some configuration?
I have created a new user for yarn and running that user.
WARN Utils:66 - Your hostname, ukaleem-HP-EliteBook-850-G3 resolves to a loopback address: 127.0.1.1; using 10.XX.XX.XX instead (on interface enp0s31f6)
2018-06-14 16:50:41 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Warning: Local jar /home/yarn/Documents/Scala-Examples/./examples/jars/spark-examples_2.11-2.3.0.jar does not exist, skipping.
2018-06-14 16:50:42 INFO RMProxy:98 - Connecting to ResourceManager at /0.0.0.0:8032
2018-06-14 16:50:44 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
And in the end, it gives the exception
Exception in thread "main" java.net.ConnectException: Call From ukaleem-HP-EliteBook-850-G3/127.0.1.1 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy9.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:487)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:155)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:155)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:154)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1146)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1518)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:179)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:177)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 28 more
2018-06-14 17:10:53 INFO ShutdownHookManager:54 - Shutdown hook called
2018-06-14 17:10:53 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-5bddb7f3-165f-451c-8ab4-bb7729f4237c
EDIT : After adding config files to my spark/conf dir, I get this error now.
The files I added are
*core-site.xml
dfs.hosts
masters
slaves
yarn-site.xml*
And some more. What I understand is that I only need yarn-site.xml to tell spark the location of the yarn cluster. (ids, address, hostname etc).
All this time I had been thinking that even we want to submit a job on Yarn these config need to go in /etc/Hadoop dir not in Spark/conf. Whats the purpose of installing hadoop then (other than communicating)?
And following this question. If the config need to go in spark/conf then HADOOP_CONF_DIR & YARN_CONF_DIR should point to etc/hadoop dir or spark/conf?
INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
18/06/19 11:04:50 INFO retry.RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 38176ms.
java.net.ConnectException: Call From ukaleem-HP-EliteBook-850-G3/127.0.1.1 to svc-hadoop-mgnt-pre-c2-01.jamba.net:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:487)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:155)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:155)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:59)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:154)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1146)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1518)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:179)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:177)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 29 more
Assuming you have a fully distributed yarn cluster: your spark-submit script is unable to find the configuration for the yarn resourcemanager (basically the yarn master node). Ensure you have HADOOP_CONF_DIR properly set in your environment, and that it points to your cluster's configuration. Specifically your yarn-site.xml.
Edit: more detail
The hadoop package comes with both server and client software. The server software would be the many daemons that run that make up the cluster. If your workstation is acting as a client (using that term loosely, not fully related to sparks --deploy-mode), then the hadoop client software must know the network locations of the server daemons running in the cluster. If your yarn-site.xml is empty, then it is pulling it's default values from yarn-defauls.xml (which is hard-coded, I believe).
Assuming your cluster is not running in HA mode, and is a mostly default configuration, then your workstation's yarn-site.xml should contain at least an entry like the following:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>rm-host.yourdomain.com</value>
</property>
Obviously, replace the hostname with the hostname where your actual resource manager is running. Of course, any spark interaction with HDFS will require a properly configured hdfs-site.xml, etc.
Some cluster managing software will have something like "generate client configs" (thinking of my cloudera experience specifically), which will give you a .tar.gz with all of the config files correctly populated to access the cluster from an external workstation.
Further recommendations:
If you plan to do spark on yarn a lot in this cluster, spark recommends making sure that you have the external shuffle service configured to launch with your yarn node managers. (Please bear in mind, this config directive would have to be present in the yarn-site.xml where yarn's node manager services are running, not on your workstation.
If you are running this on your local machine,
Update your /etc/hosts file, Enter 127.0.0.1 against your hostname.

Worker failed to connect to master in Spark Apache

I'm deploying a Spark Apache application using standalone cluster manager. My architecture uses 2 Windows machines: one set as a master, and another set as a slave (worker).
Master: on which I run: \bin>spark-class org.apache.spark.deploy.master.Master and this is what the web UI shows:
Slave: on which I run: \bin>spark-class org.apache.spark.deploy.worker.Worker spark://192.*.*.186:7077 and this what what the web UI shows:
The problem is that the worker node can not connect to the master node and shows the following error:
17/09/26 16:05:17 INFO Worker: Connecting to master 192.*.*.186:7077...
17/09/26 16:05:22 WARN Worker: Failed to connect to master 192.*.*.186:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /192.*.*.186:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: /192.*.*.186:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
What can be the case of this error knowing that the firewall is disabled for both machines and I tested the connection between them both (using nmap) and everything is OK! But using telnet I receive this error: Connecting To 192.*.*.186...Could not open connection to the host, on port 23: Connect failed
Can you show me your spark-env.sh conf? This would help to pinpoint your problem.
My first idea is that you need to export SPARK_MASTER_HOST=(master ip) instead of SPARK_MASTER_IP in spark-env.sh file. You need to do it for both master and slave. Also export SPARK_LOCAL_IP for both master and slave.
You need to set your environment path to SPARK_MASTER_HOST & SPARK_LOCAL_HOST to localhost.
SPARK_LOCAL_IP & SPARK_MASTER_IP is now deprecated.

Spark ec2 web services start up error

I am starting spark cluster in amazon EC2. In web UI I could see 2 instances(1 master and 1 slave) running and I am able to ssh to those instances. I opened up port 7077 inbound. When I try to launch the spark shell using the below command, I am getting an error. Any help would be appreciated.
spark-shell --master spark://ec2-54-173-210-192.compute-1.amazonaws.com:7077
Logs:
java.io.IOException: Failed to connect to ec2-54-173-210-192.compute-1.amazonaws.com/54.173.210.192:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: ec2-54-173-210-192.compute-1.amazonaws.com/54.173.210.192:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Not really solving the problem but just a suggestion: If you want a Spark cluster on Amazon, try Spark on Amazon EMR before EC2. EMR lets you launch a managed cluster of any size to which Spark can be added as a preinstalled application. No need to configure hosts/ports yourself.

Resources